walk: Send WorkerResults in batches by tavianator · Pull Request #1422 · sharkdp/fd (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation30 Commits2 Checks15 Files changed

Conversation

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Fixes #1408, fixes #1362.

Benchmark results

Complete traversal

linux v6.5 (86,380 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs bench/corpus/linux -false	19.3 ± 0.4	18.6	20.3	1.02 ± 0.08
find bench/corpus/linux -false	96.7 ± 0.4	96.0	97.2	5.14 ± 0.39
fd -u '^$' bench/corpus/linux	229.0 ± 29.4	135.6	239.2	12.16 ± 1.82
fd-master -u '^$' bench/corpus/linux	61.4 ± 18.2	29.7	74.4	3.26 ± 1.00
fd-batch -u '^$' bench/corpus/linux	18.8 ± 1.4	16.0	20.8	1.00

rust 1.72.1 (192,714 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs bench/corpus/rust -false	53.7 ± 1.9	51.2	58.4	1.57 ± 0.11
find bench/corpus/rust -false	304.5 ± 0.9	302.7	305.8	8.91 ± 0.51
fd -u '^$' bench/corpus/rust	360.0 ± 0.9	358.7	361.4	10.53 ± 0.61
fd-master -u '^$' bench/corpus/rust	70.9 ± 20.6	44.5	91.1	2.07 ± 0.62
fd-batch -u '^$' bench/corpus/rust	34.2 ± 2.0	30.9	37.3	1.00

chromium 119.0.6036.2 (2,119,292 files)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs bench/corpus/chromium -false	516.9 ± 9.1	504.1	532.8	2.10 ± 0.04
find bench/corpus/chromium -false	3218.6 ± 9.7	3205.6	3242.2	13.07 ± 0.12
fd -u '^$' bench/corpus/chromium	2522.9 ± 50.4	2484.3	2602.1	10.25 ± 0.22
fd-master -u '^$' bench/corpus/chromium	281.3 ± 21.3	259.5	306.2	1.14 ± 0.09
fd-batch -u '^$' bench/corpus/chromium	246.2 ± 2.1	243.5	250.1	1.00

Printing paths

Without colors

linux v6.5

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs bench/corpus/linux	32.0 ± 1.5	29.4	34.8	1.47 ± 0.11
find bench/corpus/linux	102.4 ± 1.0	101.1	104.5	4.70 ± 0.27
fd -u --search-path bench/corpus/linux	152.3 ± 41.8	132.7	248.7	7.00 ± 1.96
fd-master -u --search-path bench/corpus/linux	72.4 ± 22.0	46.3	98.7	3.33 ± 1.03
fd-batch -u --search-path bench/corpus/linux	21.8 ± 1.2	19.4	23.7	1.00

chromium 119.0.6036.2

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs bench/corpus/chromium	707.0 ± 28.2	668.0	768.4	2.32 ± 0.10
find bench/corpus/chromium	3378.1 ± 11.0	3368.4	3399.6	11.09 ± 0.12
fd -u --search-path bench/corpus/chromium	2495.7 ± 64.0	2440.6	2577.8	8.20 ± 0.23
fd-master -u --search-path bench/corpus/chromium	776.0 ± 19.8	742.4	820.3	2.55 ± 0.07
fd-batch -u --search-path bench/corpus/chromium	304.5 ± 3.1	297.2	307.3	1.00

With colors

linux v6.5

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs bench/corpus/linux -color	221.5 ± 2.8	215.2	226.3	4.43 ± 0.26
fd -u --search-path bench/corpus/linux --color=always	172.4 ± 51.1	133.9	251.0	3.45 ± 1.04
fd-master -u --search-path bench/corpus/linux --color=always	81.1 ± 18.3	69.0	120.2	1.62 ± 0.38
fd-batch -u --search-path bench/corpus/linux --color=always	50.0 ± 2.9	47.4	56.9	1.00

chromium 119.0.6036.2

Command	Mean [s]	Min [s]	Max [s]	Relative
bfs bench/corpus/chromium -color	5.644 ± 0.022	5.612	5.685	4.64 ± 0.07
fd -u --search-path bench/corpus/chromium --color=always	2.502 ± 0.072	2.448	2.614	2.06 ± 0.07
fd-master -u --search-path bench/corpus/chromium --color=always	4.738 ± 0.156	4.496	5.037	3.89 ± 0.14
fd-batch -u --search-path bench/corpus/chromium --color=always	1.218 ± 0.018	1.199	1.250	1.00

Parallelism

rust 1.72.1

`-j1`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs -j1 bench/corpus/rust -false	219.0 ± 1.7	216.7	221.5	1.00
fd -j1 -u '^$' bench/corpus/rust	271.1 ± 2.7	269.2	278.8	1.24 ± 0.02
fd-master -j1 -u '^$' bench/corpus/rust	275.5 ± 1.7	273.6	279.0	1.26 ± 0.01
fd-batch -j1 -u '^$' bench/corpus/rust	275.2 ± 2.3	271.8	278.8	1.26 ± 0.01

`-j2`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs -j2 bench/corpus/rust -false	199.7 ± 3.3	196.1	207.9	1.30 ± 0.03
fd -j2 -u '^$' bench/corpus/rust	219.3 ± 6.8	210.0	229.6	1.42 ± 0.05
fd-master -j2 -u '^$' bench/corpus/rust	158.4 ± 1.7	155.0	160.6	1.03 ± 0.02
fd-batch -j2 -u '^$' bench/corpus/rust	154.2 ± 2.0	152.0	159.8	1.00

`-j3`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs -j3 bench/corpus/rust -false	118.0 ± 7.0	110.3	129.7	1.08 ± 0.07
fd -j3 -u '^$' bench/corpus/rust	214.3 ± 7.4	198.6	224.2	1.95 ± 0.07
fd-master -j3 -u '^$' bench/corpus/rust	115.7 ± 3.5	111.0	120.3	1.06 ± 0.04
fd-batch -j3 -u '^$' bench/corpus/rust	109.6 ± 1.7	105.9	112.5	1.00

`-j4`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs -j4 bench/corpus/rust -false	85.7 ± 5.2	79.0	95.4	1.00
fd -j4 -u '^$' bench/corpus/rust	221.8 ± 6.8	204.5	234.2	2.59 ± 0.18
fd-master -j4 -u '^$' bench/corpus/rust	94.1 ± 4.3	88.4	100.0	1.10 ± 0.08
fd-batch -j4 -u '^$' bench/corpus/rust	86.5 ± 1.3	82.7	88.8	1.01 ± 0.06

`-j6`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs -j6 bench/corpus/rust -false	62.4 ± 1.8	59.8	65.5	1.00
fd -j6 -u '^$' bench/corpus/rust	231.9 ± 13.6	201.8	246.2	3.71 ± 0.24
fd-master -j6 -u '^$' bench/corpus/rust	76.2 ± 5.7	66.1	81.8	1.22 ± 0.10
fd-batch -j6 -u '^$' bench/corpus/rust	62.5 ± 1.3	60.6	66.7	1.00 ± 0.04

`-j8`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs -j8 bench/corpus/rust -false	53.0 ± 1.3	51.0	55.9	1.03 ± 0.04
fd -j8 -u '^$' bench/corpus/rust	230.0 ± 4.4	223.8	237.5	4.49 ± 0.14
fd-master -j8 -u '^$' bench/corpus/rust	59.4 ± 6.6	53.9	76.3	1.16 ± 0.13
fd-batch -j8 -u '^$' bench/corpus/rust	51.2 ± 1.2	49.3	53.4	1.00

`-j12`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs -j12 bench/corpus/rust -false	53.6 ± 2.8	48.2	62.7	1.32 ± 0.08
fd -j12 -u '^$' bench/corpus/rust	245.5 ± 14.9	224.8	268.6	6.03 ± 0.40
fd-master -j12 -u '^$' bench/corpus/rust	56.5 ± 11.8	48.2	75.8	1.39 ± 0.29
fd-batch -j12 -u '^$' bench/corpus/rust	40.7 ± 1.1	39.0	43.8	1.00

`-j16`

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs -j16 bench/corpus/rust -false	68.8 ± 7.0	54.7	79.2	1.88 ± 0.21
fd -j16 -u '^$' bench/corpus/rust	246.9 ± 10.5	238.5	276.0	6.76 ± 0.42
fd-master -j16 -u '^$' bench/corpus/rust	54.6 ± 14.9	45.2	81.1	1.50 ± 0.41
fd-batch -j16 -u '^$' bench/corpus/rust	36.5 ± 1.7	33.7	39.7	1.00

Process spawning

linux v6.5

One file per process

Command	Mean [s]	Min [s]	Max [s]	Relative
bfs bench/corpus/linux -maxdepth 2 -exec true -- {} \;	1.391 ± 0.066	1.309	1.469	9.65 ± 0.75
find bench/corpus/linux -maxdepth 2 -exec true -- {} \;	1.351 ± 0.028	1.312	1.396	9.37 ± 0.61
fd -u --search-path bench/corpus/linux --max-depth=2 -x true --	0.274 ± 0.059	0.182	0.349	1.90 ± 0.42
fd -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --	1.015 ± 0.028	0.970	1.050	7.04 ± 0.48
fd-master -u --search-path bench/corpus/linux --max-depth=2 -x true --	0.157 ± 0.021	0.136	0.191	1.09 ± 0.16
fd-master -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --	1.013 ± 0.015	0.998	1.047	7.03 ± 0.44
fd-batch -u --search-path bench/corpus/linux --max-depth=2 -x true --	0.144 ± 0.009	0.127	0.158	1.00
fd-batch -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true --	1.013 ± 0.028	0.973	1.064	7.03 ± 0.47

Many files per process

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs bench/corpus/linux -exec true -- {} +	73.0 ± 1.8	69.8	76.3	1.00
find bench/corpus/linux -exec true -- {} +	256.1 ± 0.7	255.2	257.3	3.51 ± 0.08
fd -u --search-path bench/corpus/linux -X true --	311.0 ± 33.1	217.1	328.5	4.26 ± 0.47
fd-master -u --search-path bench/corpus/linux -X true --	198.9 ± 20.4	170.4	215.6	2.72 ± 0.29
fd-batch -u --search-path bench/corpus/linux -X true --	146.5 ± 1.8	144.3	152.0	2.01 ± 0.05

Spawn in parent directory

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
bfs bench/corpus/linux -maxdepth 3 -execdir true -- {} +	972.9 ± 15.2	943.3	995.2	1.00
find bench/corpus/linux -maxdepth 3 -execdir true -- {} +	1030.5 ± 25.0	991.8	1062.4	1.06 ± 0.03

Details

Versions

$ bfs --version | head -n1 bfs 3.0.4 $ find --version | head -n1 find (GNU findutils) 4.9.0 $ fd --version fd 8.7.1 $ fd-master --version fd 8.7.1 $ fd-batch --version fd 8.7.1

tavianator@tachyon $ hyperfine -w2 fd{,-{master,batch}}" -u '^$' /tmp/empty" Benchmark 1: fd -u '^$' /tmp/empty Time (mean ± σ): 143.1 ms ± 5.3 ms [User: 9.0 ms, System: 132.6 ms] Range (min … max): 134.2 ms … 152.7 ms 21 runs

Benchmark 2: fd-master -u '^$' /tmp/empty Time (mean ± σ): 63.6 ms ± 8.6 ms [User: 6.0 ms, System: 58.2 ms] Range (min … max): 51.7 ms … 80.0 ms 43 runs

Benchmark 3: fd-batch -u '^$' /tmp/empty Time (mean ± σ): 6.8 ms ± 2.0 ms [User: 2.6 ms, System: 7.6 ms] Range (min … max): 4.3 ms … 12.4 ms 283 runs

Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the -N/--shell=none option to disable the shell completely.

Summary fd-batch -u '^$' /tmp/empty ran 9.33 ± 3.08 times faster than fd-master -u '^$' /tmp/empty 20.99 ± 6.36 times faster than fd -u '^$' /tmp/empty

let items = batch.as_mut().unwrap();
items.push(item);

if items.len() == 1 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to send over batches after reaching a certain size. That would mean we could know how large to set the initial capacity of the Vec, and possibly avoid contention on the mutex. However, it means receiver threads could end up waiting longer to get results, especially if it takes a while to find more results in the sender threads, so what you have might be better.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just measured the average batch size: 209. I don't think we have to try too hard to send larger batches :)

The risk of a minimum batch size is it could stall a very long time if most results are being filtered out. We could do something like always send the batch after N entries are encountered, regardless of whether they're added to the batch. But I doubt it's worth it.

There isn't very much mutex contention with this design anyway. It's only between the receiver and at most one sender, and the receiver critical section is extremely short (just lock().take().unwrap().into_iter()). I actually just checked with perf trace and the receiver only blocked 136 times over the whole Chromium benchmark (2.1M files). Each sender blocked between 3-12 times.

Thanks @tmccombs! I'll let @sharkdp review it too since he has his own benchmarks and he found a fatal flaw in the last attempt :)

So I am comparing master @ d62bbbb with this branch @ 815b3b1. After struggling a bit with thermal throttling affecting the benchmark results, I now have clean results — and they look great!

I see the expected massive improvement in startup speed (last benchmark below), which is great
I see a large (20%) performance gain on searches with lots of results, which is great
I see a small (< 5%) performance gain on searches with few results, which is also great
I see a medium (6-8%) performance loss on command execution benchmarks, which might be acceptable — but maybe something we could still look into?

`fd` regression benchmark

No pattern

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
./fd-master --hidden --no-ignore '' '/some/folder'	798.4 ± 33.6	746.0	842.6	1.18 ± 0.05
./fd-1422 --hidden --no-ignore '' '/some/folder'	677.7 ± 8.3	667.9	698.9	1.00

Simple pattern

Command	Mean [s]	Min [s]	Max [s]	Relative
./fd-master '.*[0-9]\.jpg$' '/some/folder'	1.492 ± 0.006	1.483	1.504	1.01 ± 0.01
./fd-1422 '.*[0-9]\.jpg$' '/some/folder'	1.479 ± 0.008	1.466	1.494	1.00

Simple pattern (-HI)

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
./fd-master -HI '.*[0-9]\.jpg$' '/some/folder'	590.3 ± 2.8	585.7	594.3	1.05 ± 0.01
./fd-1422 -HI '.*[0-9]\.jpg$' '/some/folder'	562.8 ± 2.4	557.5	566.0	1.00

File extension

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
./fd-master -HI --extension jpg '' '/some/folder'	644.9 ± 2.7	640.1	647.6	1.04 ± 0.01
./fd-1422 -HI --extension jpg '' '/some/folder'	620.1 ± 2.4	616.3	622.9	1.00

File type

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
./fd-master -HI --type l '' '/some/folder'	605.1 ± 8.6	599.0	628.6	1.05 ± 0.02
./fd-1422 -HI --type l '' '/some/folder'	575.2 ± 4.0	570.2	580.7	1.00

Command execution

Command	Mean [s]	Min [s]	Max [s]	Relative
./fd-master 'ab' '/some/folder' --exec echo	4.711 ± 0.031	4.688	4.792	1.00
./fd-1422 'ab' '/some/folder' --exec echo	5.109 ± 0.036	5.057	5.176	1.08 ± 0.01

Command execution (large output)

Command	Mean [s]	Min [s]	Max [s]	Relative
./fd-master -tf 'ab' '/some/folder' --exec cat	4.719 ± 0.028	4.682	4.779	1.00
./fd-1422 -tf 'ab' '/some/folder' --exec cat	5.015 ± 0.053	4.915	5.087	1.06 ± 0.01

Empty folder benchmark

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
./fd-master -u . /tmp/empty	28.2 ± 0.6	27.3	31.2	7.65 ± 0.44
./fd-1422 -u . /tmp/empty	3.7 ± 0.2	3.5	5.3	1.00

Yeah I see the --exec regression in my benchmarks too. I'm guessing it's because we have N receiver threads in that case, not just 1, so there's more contention. I'm testing a fix.

Actually the problem is not contention, its that results are not evenly distributed to the receivers because the batch sizes vary wildly. So one exec::job() can get way more results to process than the others. E.g. I just saw this distribution:

-x: 0
-x: 0
-x: 0
-x: 3
-x: 2
-x: 33
-x: 45
-x: 55
-x: 63
-x: 66
-x: 47
-x: 73
-x: 116
-x: 133
-x: 132
-x: 115
-x: 125
-x: 151
-x: 186
-x: 421

Maybe lowering the max batch size for -x will help.

Maybe lowering the max batch size for -x will help.

Yep, that works! --exec perf is now better than master, and nothing else seems to have regressed. New benchmark results are in the top comment.

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads. I'm not sure if that would be worth the complexity though.

Unfortunately, it looks like things got worse for me with 1469bf3

Command	Mean [s]	Min [s]	Max [s]	Relative
./fd-master 'ab' '/some/folder' --exec echo	4.742 ± 0.099	4.691	5.022	1.00
./fd-1422 'ab' '/some/folder' --exec echo	5.791 ± 0.031	5.759	5.846	1.22 ± 0.03

I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest?

Unfortunately, it looks like things got worse for me with 1469bf3

Interesting! Doesn't reproduce for me, even with a similar test (echo instead of true, no -u, similar execution time):

tavianator@tachyon $ hyperfine -w2 fd-{batch,master}" --search-path ~/code/linux -d4 -x echo" Benchmark 1: fd-batch --search-path ~/code/linux -d4 -x echo Time (mean ± σ): 3.622 s ± 0.012 s [User: 36.389 s, System: 32.154 s] Range (min … max): 3.602 s … 3.640 s 10 runs

Benchmark 2: fd-master --search-path ~/code/linux -d4 -x echo Time (mean ± σ): 3.693 s ± 0.011 s [User: 36.529 s, System: 32.544 s] Range (min … max): 3.674 s … 3.707 s 10 runs

Summary fd-batch --search-path ~/code/linux -d4 -x echo ran 1.02 ± 0.00 times faster than fd-master --search-path ~/code/linux -d4 -x echo

I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest?

In the degenerate case, any batch size N > 1 can lead to an Nx slowdown: if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel.

Would you mind trying some variations?

Set the batch size to 1 in --exec mode (line 453)
Set the channel capacity to 2 * config.threads (line 641)
Both of the above

Neither of those makes a big difference for me, but they may help on your machine.

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads. I'm not sure if that would be worth the complexity though.

That's worth trying!

if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel

It's not that unlikely. I often use fd -x to simply perform a task in parallel across a list of files in the current directory (or "closeby", i.e. the search is extremely fast compared to the executed tasks). For example:

> mkdir /tmp/test
> cd /tmp/test
> touch $(seq 12)

and then:

> fd-master -x bash -c "sleep 1 && echo {}"
[takes 1 second]

> fd-1422 -x bash -c "sleep 1 && echo {}"
[takes 11 seconds]

I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads.

I think that would have the same problem. We really want to decouple the search from the command execution in order to balance the load in both of these cases: (1) long search and fast task execution (2) fast search and long task execution.

@sharkdp Actually it seems like both the changes I suggested in #1422 (comment) are beneficial in general, so I included them in this PR. Can you give it another try?

let mut results: Vec = Vec::new();
loop {
let mut ret = ExitCode::Success;
for result in results {
// Obtain the next result from the receiver, else if the channel
// has closed, exit from the loop

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// has closed, exit from the loop
// has closed, exit from the loop.

@@ -36,13 +36,91 @@ enum ReceiverMode {

/// The Worker threads can result in a valid entry having PathBuf or an error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// The Worker threads can result in a valid entry having PathBuf or an error.
/// The Worker threads can result in a valid entry having `PathBuf` or an error.

pub enum WorkerResult {
// Errors should be rare, so it's probably better to allow large_enum_variant than
// to box the Entry variant
Entry(DirEntry),
Error(ignore::Error),
}

/// A batch of WorkerResults to send over a channel.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// A batch of WorkerResults to send over a channel.
/// A batch of `WorkerResult`s to send over a channel.

Ok(())
}
}

/// Maximum size of the output buffer before flushing results to the console

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// Maximum size of the output buffer before flushing results to the console
/// Maximum size of the output buffer before flushing results to the console.

@@ -319,13 +403,13 @@ impl WorkerState {

/// Run the receiver work, either on this thread or a pool of background
/// threads (for --exec).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// threads (for --exec).
/// threads (for `--exec`).

I re-ran the benchmarks, comparing master @ 5903dec with this branch @ b8a5f95.

I do see comparable performance between this branch and master for larger searches
I do see comparable performance for --exec scenarios
I still see the massive speedup for small/empty folders.

But it might be worth noting that the speedups for longer-running searches that I saw in previous iterations are now apparently gone. This still seems worth merging though to increase the startup speed!

Should we look into this comment, now that this is merged? #1412 (comment)

Yeah probably. I'd be curious to see how it scales on different CPUs. Here's my results for

$ hyperfine -w1 -P j 4 48 -D 2 "fd -j{j} -u --search-path ~/code/bfs/bench/corpus/chromium"

In both cases the sweet spot is near the physical core count, so maybe we should try num_cpus::get_physical() as the default (with a higher cap like 64).

Threadripper 3960X (24 cores, 48 threads)

Intel Core i7-1280P (14 cores (6 P-cores, 8 E-cores), 20 threads)

I assume the high variance at -j14 is due to sometimes getting scheduled on a hyperthread.

Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads?

Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads?

Well this is interesting. I tried to simulate this using taskset to pin fd to various CPUs. Not sure how closely this matches real CPUs with fewer cores, but here's what I found:

Even when pinned to a single core/thread, -j8 is much better than -j1. I don't really understand why yet. I mean, -j8 lets it do more I/O in parallel, but everything is cached in this benchmark so that shouldn't matter.

Intel Core i7-10850H, 6 cores, 12 threads

Does not seem to be so clear in this case. Setting it to num_cpus::get_physical() == 6 in this case would not lead to the best performance.

And this is from a server with nproc == 6. Apparently it's a Intel Xeon E5-2680 (which should have 8/16 cores/threads, but I guess there is some virtualization going on(?).

Reviewers

tmccombs tmccombs approved these changes

sharkdp sharkdp approved these changes

walk: Send WorkerResults in batches by tavianator · Pull Request #1422 · sharkdp/fd (original) (raw)

Conversation

Complete traversal

linux v6.5 (86,380 files)

rust 1.72.1 (192,714 files)

chromium 119.0.6036.2 (2,119,292 files)

Printing paths

Without colors

linux v6.5

chromium 119.0.6036.2

With colors

linux v6.5

chromium 119.0.6036.2

Parallelism

rust 1.72.1

-j1

-j2

-j3

-j4

-j6

-j8

-j12

-j16

Process spawning

linux v6.5

One file per process

Many files per process

Spawn in parent directory

Details

Versions

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fd regression benchmark

No pattern

Simple pattern

Simple pattern (-HI)

File extension

File type

Command execution

Command execution (large output)

Empty folder benchmark

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

`-j1`

`-j2`

`-j3`

`-j4`

`-j6`

`-j8`

`-j12`

`-j16`

`fd` regression benchmark