walk: Send WorkerResults in batches by tavianator · Pull Request #1422 · sharkdp/fd (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation30 Commits2 Checks15 Files changed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
Benchmark results
Complete traversal
linux v6.5 (86,380 files)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs bench/corpus/linux -false | 19.3 ± 0.4 | 18.6 | 20.3 | 1.02 ± 0.08 |
find bench/corpus/linux -false | 96.7 ± 0.4 | 96.0 | 97.2 | 5.14 ± 0.39 |
fd -u '^$' bench/corpus/linux | 229.0 ± 29.4 | 135.6 | 239.2 | 12.16 ± 1.82 |
fd-master -u '^$' bench/corpus/linux | 61.4 ± 18.2 | 29.7 | 74.4 | 3.26 ± 1.00 |
fd-batch -u '^$' bench/corpus/linux | 18.8 ± 1.4 | 16.0 | 20.8 | 1.00 |
rust 1.72.1 (192,714 files)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs bench/corpus/rust -false | 53.7 ± 1.9 | 51.2 | 58.4 | 1.57 ± 0.11 |
find bench/corpus/rust -false | 304.5 ± 0.9 | 302.7 | 305.8 | 8.91 ± 0.51 |
fd -u '^$' bench/corpus/rust | 360.0 ± 0.9 | 358.7 | 361.4 | 10.53 ± 0.61 |
fd-master -u '^$' bench/corpus/rust | 70.9 ± 20.6 | 44.5 | 91.1 | 2.07 ± 0.62 |
fd-batch -u '^$' bench/corpus/rust | 34.2 ± 2.0 | 30.9 | 37.3 | 1.00 |
chromium 119.0.6036.2 (2,119,292 files)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs bench/corpus/chromium -false | 516.9 ± 9.1 | 504.1 | 532.8 | 2.10 ± 0.04 |
find bench/corpus/chromium -false | 3218.6 ± 9.7 | 3205.6 | 3242.2 | 13.07 ± 0.12 |
fd -u '^$' bench/corpus/chromium | 2522.9 ± 50.4 | 2484.3 | 2602.1 | 10.25 ± 0.22 |
fd-master -u '^$' bench/corpus/chromium | 281.3 ± 21.3 | 259.5 | 306.2 | 1.14 ± 0.09 |
fd-batch -u '^$' bench/corpus/chromium | 246.2 ± 2.1 | 243.5 | 250.1 | 1.00 |
Printing paths
Without colors
linux v6.5
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs bench/corpus/linux | 32.0 ± 1.5 | 29.4 | 34.8 | 1.47 ± 0.11 |
find bench/corpus/linux | 102.4 ± 1.0 | 101.1 | 104.5 | 4.70 ± 0.27 |
fd -u --search-path bench/corpus/linux | 152.3 ± 41.8 | 132.7 | 248.7 | 7.00 ± 1.96 |
fd-master -u --search-path bench/corpus/linux | 72.4 ± 22.0 | 46.3 | 98.7 | 3.33 ± 1.03 |
fd-batch -u --search-path bench/corpus/linux | 21.8 ± 1.2 | 19.4 | 23.7 | 1.00 |
chromium 119.0.6036.2
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs bench/corpus/chromium | 707.0 ± 28.2 | 668.0 | 768.4 | 2.32 ± 0.10 |
find bench/corpus/chromium | 3378.1 ± 11.0 | 3368.4 | 3399.6 | 11.09 ± 0.12 |
fd -u --search-path bench/corpus/chromium | 2495.7 ± 64.0 | 2440.6 | 2577.8 | 8.20 ± 0.23 |
fd-master -u --search-path bench/corpus/chromium | 776.0 ± 19.8 | 742.4 | 820.3 | 2.55 ± 0.07 |
fd-batch -u --search-path bench/corpus/chromium | 304.5 ± 3.1 | 297.2 | 307.3 | 1.00 |
With colors
linux v6.5
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs bench/corpus/linux -color | 221.5 ± 2.8 | 215.2 | 226.3 | 4.43 ± 0.26 |
fd -u --search-path bench/corpus/linux --color=always | 172.4 ± 51.1 | 133.9 | 251.0 | 3.45 ± 1.04 |
fd-master -u --search-path bench/corpus/linux --color=always | 81.1 ± 18.3 | 69.0 | 120.2 | 1.62 ± 0.38 |
fd-batch -u --search-path bench/corpus/linux --color=always | 50.0 ± 2.9 | 47.4 | 56.9 | 1.00 |
chromium 119.0.6036.2
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
bfs bench/corpus/chromium -color | 5.644 ± 0.022 | 5.612 | 5.685 | 4.64 ± 0.07 |
fd -u --search-path bench/corpus/chromium --color=always | 2.502 ± 0.072 | 2.448 | 2.614 | 2.06 ± 0.07 |
fd-master -u --search-path bench/corpus/chromium --color=always | 4.738 ± 0.156 | 4.496 | 5.037 | 3.89 ± 0.14 |
fd-batch -u --search-path bench/corpus/chromium --color=always | 1.218 ± 0.018 | 1.199 | 1.250 | 1.00 |
Parallelism
rust 1.72.1
-j1
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs -j1 bench/corpus/rust -false | 219.0 ± 1.7 | 216.7 | 221.5 | 1.00 |
fd -j1 -u '^$' bench/corpus/rust | 271.1 ± 2.7 | 269.2 | 278.8 | 1.24 ± 0.02 |
fd-master -j1 -u '^$' bench/corpus/rust | 275.5 ± 1.7 | 273.6 | 279.0 | 1.26 ± 0.01 |
fd-batch -j1 -u '^$' bench/corpus/rust | 275.2 ± 2.3 | 271.8 | 278.8 | 1.26 ± 0.01 |
-j2
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs -j2 bench/corpus/rust -false | 199.7 ± 3.3 | 196.1 | 207.9 | 1.30 ± 0.03 |
fd -j2 -u '^$' bench/corpus/rust | 219.3 ± 6.8 | 210.0 | 229.6 | 1.42 ± 0.05 |
fd-master -j2 -u '^$' bench/corpus/rust | 158.4 ± 1.7 | 155.0 | 160.6 | 1.03 ± 0.02 |
fd-batch -j2 -u '^$' bench/corpus/rust | 154.2 ± 2.0 | 152.0 | 159.8 | 1.00 |
-j3
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs -j3 bench/corpus/rust -false | 118.0 ± 7.0 | 110.3 | 129.7 | 1.08 ± 0.07 |
fd -j3 -u '^$' bench/corpus/rust | 214.3 ± 7.4 | 198.6 | 224.2 | 1.95 ± 0.07 |
fd-master -j3 -u '^$' bench/corpus/rust | 115.7 ± 3.5 | 111.0 | 120.3 | 1.06 ± 0.04 |
fd-batch -j3 -u '^$' bench/corpus/rust | 109.6 ± 1.7 | 105.9 | 112.5 | 1.00 |
-j4
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs -j4 bench/corpus/rust -false | 85.7 ± 5.2 | 79.0 | 95.4 | 1.00 |
fd -j4 -u '^$' bench/corpus/rust | 221.8 ± 6.8 | 204.5 | 234.2 | 2.59 ± 0.18 |
fd-master -j4 -u '^$' bench/corpus/rust | 94.1 ± 4.3 | 88.4 | 100.0 | 1.10 ± 0.08 |
fd-batch -j4 -u '^$' bench/corpus/rust | 86.5 ± 1.3 | 82.7 | 88.8 | 1.01 ± 0.06 |
-j6
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs -j6 bench/corpus/rust -false | 62.4 ± 1.8 | 59.8 | 65.5 | 1.00 |
fd -j6 -u '^$' bench/corpus/rust | 231.9 ± 13.6 | 201.8 | 246.2 | 3.71 ± 0.24 |
fd-master -j6 -u '^$' bench/corpus/rust | 76.2 ± 5.7 | 66.1 | 81.8 | 1.22 ± 0.10 |
fd-batch -j6 -u '^$' bench/corpus/rust | 62.5 ± 1.3 | 60.6 | 66.7 | 1.00 ± 0.04 |
-j8
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs -j8 bench/corpus/rust -false | 53.0 ± 1.3 | 51.0 | 55.9 | 1.03 ± 0.04 |
fd -j8 -u '^$' bench/corpus/rust | 230.0 ± 4.4 | 223.8 | 237.5 | 4.49 ± 0.14 |
fd-master -j8 -u '^$' bench/corpus/rust | 59.4 ± 6.6 | 53.9 | 76.3 | 1.16 ± 0.13 |
fd-batch -j8 -u '^$' bench/corpus/rust | 51.2 ± 1.2 | 49.3 | 53.4 | 1.00 |
-j12
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs -j12 bench/corpus/rust -false | 53.6 ± 2.8 | 48.2 | 62.7 | 1.32 ± 0.08 |
fd -j12 -u '^$' bench/corpus/rust | 245.5 ± 14.9 | 224.8 | 268.6 | 6.03 ± 0.40 |
fd-master -j12 -u '^$' bench/corpus/rust | 56.5 ± 11.8 | 48.2 | 75.8 | 1.39 ± 0.29 |
fd-batch -j12 -u '^$' bench/corpus/rust | 40.7 ± 1.1 | 39.0 | 43.8 | 1.00 |
-j16
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs -j16 bench/corpus/rust -false | 68.8 ± 7.0 | 54.7 | 79.2 | 1.88 ± 0.21 |
fd -j16 -u '^$' bench/corpus/rust | 246.9 ± 10.5 | 238.5 | 276.0 | 6.76 ± 0.42 |
fd-master -j16 -u '^$' bench/corpus/rust | 54.6 ± 14.9 | 45.2 | 81.1 | 1.50 ± 0.41 |
fd-batch -j16 -u '^$' bench/corpus/rust | 36.5 ± 1.7 | 33.7 | 39.7 | 1.00 |
Process spawning
linux v6.5
One file per process
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
bfs bench/corpus/linux -maxdepth 2 -exec true -- {} \; | 1.391 ± 0.066 | 1.309 | 1.469 | 9.65 ± 0.75 |
find bench/corpus/linux -maxdepth 2 -exec true -- {} \; | 1.351 ± 0.028 | 1.312 | 1.396 | 9.37 ± 0.61 |
fd -u --search-path bench/corpus/linux --max-depth=2 -x true -- | 0.274 ± 0.059 | 0.182 | 0.349 | 1.90 ± 0.42 |
fd -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true -- | 1.015 ± 0.028 | 0.970 | 1.050 | 7.04 ± 0.48 |
fd-master -u --search-path bench/corpus/linux --max-depth=2 -x true -- | 0.157 ± 0.021 | 0.136 | 0.191 | 1.09 ± 0.16 |
fd-master -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true -- | 1.013 ± 0.015 | 0.998 | 1.047 | 7.03 ± 0.44 |
fd-batch -u --search-path bench/corpus/linux --max-depth=2 -x true -- | 0.144 ± 0.009 | 0.127 | 0.158 | 1.00 |
fd-batch -j1 -u --search-path bench/corpus/linux --max-depth=2 -x true -- | 1.013 ± 0.028 | 0.973 | 1.064 | 7.03 ± 0.47 |
Many files per process
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs bench/corpus/linux -exec true -- {} + | 73.0 ± 1.8 | 69.8 | 76.3 | 1.00 |
find bench/corpus/linux -exec true -- {} + | 256.1 ± 0.7 | 255.2 | 257.3 | 3.51 ± 0.08 |
fd -u --search-path bench/corpus/linux -X true -- | 311.0 ± 33.1 | 217.1 | 328.5 | 4.26 ± 0.47 |
fd-master -u --search-path bench/corpus/linux -X true -- | 198.9 ± 20.4 | 170.4 | 215.6 | 2.72 ± 0.29 |
fd-batch -u --search-path bench/corpus/linux -X true -- | 146.5 ± 1.8 | 144.3 | 152.0 | 2.01 ± 0.05 |
Spawn in parent directory
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
bfs bench/corpus/linux -maxdepth 3 -execdir true -- {} + | 972.9 ± 15.2 | 943.3 | 995.2 | 1.00 |
find bench/corpus/linux -maxdepth 3 -execdir true -- {} + | 1030.5 ± 25.0 | 991.8 | 1062.4 | 1.06 ± 0.03 |
Details
Versions
$ bfs --version | head -n1 bfs 3.0.4 $ find --version | head -n1 find (GNU findutils) 4.9.0 $ fd --version fd 8.7.1 $ fd-master --version fd 8.7.1 $ fd-batch --version fd 8.7.1
tavianator@tachyon $ hyperfine -w2 fd{,-{master,batch}}" -u '^$' /tmp/empty" Benchmark 1: fd -u '^$' /tmp/empty Time (mean ± σ): 143.1 ms ± 5.3 ms [User: 9.0 ms, System: 132.6 ms] Range (min … max): 134.2 ms … 152.7 ms 21 runs
Benchmark 2: fd-master -u '^$' /tmp/empty Time (mean ± σ): 63.6 ms ± 8.6 ms [User: 6.0 ms, System: 58.2 ms] Range (min … max): 51.7 ms … 80.0 ms 43 runs
Benchmark 3: fd-batch -u '^$' /tmp/empty Time (mean ± σ): 6.8 ms ± 2.0 ms [User: 2.6 ms, System: 7.6 ms] Range (min … max): 4.3 ms … 12.4 ms 283 runs
Warning: Command took less than 5 ms to complete. Note that the results might be inaccurate because hyperfine can not calibrate the shell startup time much more precise than this limit. You can try to use the -N
/--shell=none
option to disable the shell completely.
Summary fd-batch -u '^$' /tmp/empty ran 9.33 ± 3.08 times faster than fd-master -u '^$' /tmp/empty 20.99 ± 6.36 times faster than fd -u '^$' /tmp/empty
let items = batch.as_mut().unwrap(); |
---|
items.push(item); |
if items.len() == 1 { |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to send over batches after reaching a certain size. That would mean we could know how large to set the initial capacity of the Vec, and possibly avoid contention on the mutex. However, it means receiver threads could end up waiting longer to get results, especially if it takes a while to find more results in the sender threads, so what you have might be better.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just measured the average batch size: 209. I don't think we have to try too hard to send larger batches :)
The risk of a minimum batch size is it could stall a very long time if most results are being filtered out. We could do something like always send the batch after N entries are encountered, regardless of whether they're added to the batch. But I doubt it's worth it.
There isn't very much mutex contention with this design anyway. It's only between the receiver and at most one sender, and the receiver critical section is extremely short (just lock().take().unwrap().into_iter()
). I actually just checked with perf trace
and the receiver only blocked 136 times over the whole Chromium benchmark (2.1M files). Each sender blocked between 3-12 times.
Thanks @tmccombs! I'll let @sharkdp review it too since he has his own benchmarks and he found a fatal flaw in the last attempt :)
So I am comparing master @ d62bbbb with this branch @ 815b3b1. After struggling a bit with thermal throttling affecting the benchmark results, I now have clean results — and they look great!
- I see the expected massive improvement in startup speed (last benchmark below), which is great
- I see a large (20%) performance gain on searches with lots of results, which is great
- I see a small (< 5%) performance gain on searches with few results, which is also great
- I see a medium (6-8%) performance loss on command execution benchmarks, which might be acceptable — but maybe something we could still look into?
fd
regression benchmark
No pattern
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master --hidden --no-ignore '' '/some/folder' | 798.4 ± 33.6 | 746.0 | 842.6 | 1.18 ± 0.05 |
./fd-1422 --hidden --no-ignore '' '/some/folder' | 677.7 ± 8.3 | 667.9 | 698.9 | 1.00 |
Simple pattern
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master '.*[0-9]\.jpg$' '/some/folder' | 1.492 ± 0.006 | 1.483 | 1.504 | 1.01 ± 0.01 |
./fd-1422 '.*[0-9]\.jpg$' '/some/folder' | 1.479 ± 0.008 | 1.466 | 1.494 | 1.00 |
Simple pattern (-HI)
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI '.*[0-9]\.jpg$' '/some/folder' | 590.3 ± 2.8 | 585.7 | 594.3 | 1.05 ± 0.01 |
./fd-1422 -HI '.*[0-9]\.jpg$' '/some/folder' | 562.8 ± 2.4 | 557.5 | 566.0 | 1.00 |
File extension
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --extension jpg '' '/some/folder' | 644.9 ± 2.7 | 640.1 | 647.6 | 1.04 ± 0.01 |
./fd-1422 -HI --extension jpg '' '/some/folder' | 620.1 ± 2.4 | 616.3 | 622.9 | 1.00 |
File type
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -HI --type l '' '/some/folder' | 605.1 ± 8.6 | 599.0 | 628.6 | 1.05 ± 0.02 |
./fd-1422 -HI --type l '' '/some/folder' | 575.2 ± 4.0 | 570.2 | 580.7 | 1.00 |
Command execution
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master 'ab' '/some/folder' --exec echo | 4.711 ± 0.031 | 4.688 | 4.792 | 1.00 |
./fd-1422 'ab' '/some/folder' --exec echo | 5.109 ± 0.036 | 5.057 | 5.176 | 1.08 ± 0.01 |
Command execution (large output)
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master -tf 'ab' '/some/folder' --exec cat | 4.719 ± 0.028 | 4.682 | 4.779 | 1.00 |
./fd-1422 -tf 'ab' '/some/folder' --exec cat | 5.015 ± 0.053 | 4.915 | 5.087 | 1.06 ± 0.01 |
Empty folder benchmark
Command | Mean [ms] | Min [ms] | Max [ms] | Relative |
---|---|---|---|---|
./fd-master -u . /tmp/empty | 28.2 ± 0.6 | 27.3 | 31.2 | 7.65 ± 0.44 |
./fd-1422 -u . /tmp/empty | 3.7 ± 0.2 | 3.5 | 5.3 | 1.00 |
Yeah I see the --exec
regression in my benchmarks too. I'm guessing it's because we have N receiver threads in that case, not just 1, so there's more contention. I'm testing a fix.
Actually the problem is not contention, its that results are not evenly distributed to the receivers because the batch sizes vary wildly. So one exec::job()
can get way more results to process than the others. E.g. I just saw this distribution:
-x: 0
-x: 0
-x: 0
-x: 3
-x: 2
-x: 33
-x: 45
-x: 55
-x: 63
-x: 66
-x: 47
-x: 73
-x: 116
-x: 133
-x: 132
-x: 115
-x: 125
-x: 151
-x: 186
-x: 421
Maybe lowering the max batch size for -x
will help.
Maybe lowering the max batch size for
-x
will help.
Yep, that works! --exec
perf is now better than master
, and nothing else seems to have regressed. New benchmark results are in the top comment.
I was actually thinking that for the --exec
case, it might be more performant to execute on the sender thread instead of having N receiver threads. I'm not sure if that would be worth the complexity though.
Unfortunately, it looks like things got worse for me with 1469bf3
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
./fd-master 'ab' '/some/folder' --exec echo | 4.742 ± 0.099 | 4.691 | 5.022 | 1.00 |
./fd-1422 'ab' '/some/folder' --exec echo | 5.791 ± 0.031 | 5.759 | 5.846 | 1.22 ± 0.03 |
I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest?
Unfortunately, it looks like things got worse for me with 1469bf3
Interesting! Doesn't reproduce for me, even with a similar test (echo
instead of true
, no -u
, similar execution time):
tavianator@tachyon $ hyperfine -w2 fd-{batch,master}" --search-path ~/code/linux -d4 -x echo" Benchmark 1: fd-batch --search-path ~/code/linux -d4 -x echo Time (mean ± σ): 3.622 s ± 0.012 s [User: 36.389 s, System: 32.154 s] Range (min … max): 3.602 s … 3.640 s 10 runs
Benchmark 2: fd-master --search-path ~/code/linux -d4 -x echo Time (mean ± σ): 3.693 s ± 0.011 s [User: 36.529 s, System: 32.544 s] Range (min … max): 3.674 s … 3.707 s 10 runs
Summary fd-batch --search-path ~/code/linux -d4 -x echo ran 1.02 ± 0.00 times faster than fd-master --search-path ~/code/linux -d4 -x echo
I was also wondering what happens in case of (few search results but) longer running programs? Would batching lead to situations where one receiver would run significantly longer than the rest?
In the degenerate case, any batch size N > 1 can lead to an Nx slowdown: if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel.
Would you mind trying some variations?
- Set the batch size to 1 in
--exec
mode (line 453) - Set the channel capacity to
2 * config.threads
(line 641) - Both of the above
Neither of those makes a big difference for me, but they may help on your machine.
I was actually thinking that for the
--exec
case, it might be more performant to execute on the sender thread instead of having N receiver threads. I'm not sure if that would be worth the complexity though.
That's worth trying!
if there are exactly N matches, all ending up in the same batch (unlikely but possible), they will be executed sequentially rather than in parallel
It's not that unlikely. I often use fd -x
to simply perform a task in parallel across a list of files in the current directory (or "closeby", i.e. the search is extremely fast compared to the executed tasks). For example:
> mkdir /tmp/test
> cd /tmp/test
> touch $(seq 12)
and then:
> fd-master -x bash -c "sleep 1 && echo {}"
[takes 1 second]
> fd-1422 -x bash -c "sleep 1 && echo {}"
[takes 11 seconds]
I was actually thinking that for the --exec case, it might be more performant to execute on the sender thread instead of having N receiver threads.
I think that would have the same problem. We really want to decouple the search from the command execution in order to balance the load in both of these cases: (1) long search and fast task execution (2) fast search and long task execution.
@sharkdp Actually it seems like both the changes I suggested in #1422 (comment) are beneficial in general, so I included them in this PR. Can you give it another try?
let mut results: Vec = Vec::new(); |
---|
loop { |
let mut ret = ExitCode::Success; |
for result in results { |
// Obtain the next result from the receiver, else if the channel |
// has closed, exit from the loop |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// has closed, exit from the loop |
---|
// has closed, exit from the loop. |
@@ -36,13 +36,91 @@ enum ReceiverMode { |
---|
/// The Worker threads can result in a valid entry having PathBuf or an error. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The Worker threads can result in a valid entry having PathBuf or an error. |
---|
/// The Worker threads can result in a valid entry having `PathBuf` or an error. |
pub enum WorkerResult { |
---|
// Errors should be rare, so it's probably better to allow large_enum_variant than |
// to box the Entry variant |
Entry(DirEntry), |
Error(ignore::Error), |
} |
/// A batch of WorkerResults to send over a channel. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// A batch of WorkerResults to send over a channel. |
---|
/// A batch of `WorkerResult`s to send over a channel. |
Ok(()) |
---|
} |
} |
/// Maximum size of the output buffer before flushing results to the console |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Maximum size of the output buffer before flushing results to the console |
---|
/// Maximum size of the output buffer before flushing results to the console. |
@@ -319,13 +403,13 @@ impl WorkerState { |
---|
/// Run the receiver work, either on this thread or a pool of background |
/// threads (for --exec). |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// threads (for --exec). |
---|
/// threads (for `--exec`). |
I re-ran the benchmarks, comparing master @ 5903dec with this branch @ b8a5f95.
- I do see comparable performance between this branch and master for larger searches
- I do see comparable performance for
--exec
scenarios - I still see the massive speedup for small/empty folders.
But it might be worth noting that the speedups for longer-running searches that I saw in previous iterations are now apparently gone. This still seems worth merging though to increase the startup speed!
Should we look into this comment, now that this is merged? #1412 (comment)
Yeah probably. I'd be curious to see how it scales on different CPUs. Here's my results for
$ hyperfine -w1 -P j 4 48 -D 2 "fd -j{j} -u --search-path ~/code/bfs/bench/corpus/chromium"
In both cases the sweet spot is near the physical core count, so maybe we should try num_cpus::get_physical() as the default (with a higher cap like 64).
- Threadripper 3960X (24 cores, 48 threads)
- Intel Core i7-1280P (14 cores (6 P-cores, 8 E-cores), 20 threads)
I assume the high variance at -j14
is due to sometimes getting scheduled on a hyperthread.
Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads?
Mainly I'm curious about lower core count CPUs. E.g. on a 1-core, 2-thread CPU, I assume -j2 is better than -j1. On a 2-core, 4-thread CPU, -j4 is probably better than -j2. On a 4-core, 8-thread CPU? I don't really know. Where's the crossover point where we want to stop using hyperthreads?
Well this is interesting. I tried to simulate this using taskset
to pin fd
to various CPUs. Not sure how closely this matches real CPUs with fewer cores, but here's what I found:
Even when pinned to a single core/thread, -j8
is much better than -j1
. I don't really understand why yet. I mean, -j8
lets it do more I/O in parallel, but everything is cached in this benchmark so that shouldn't matter.
Intel Core i7-10850H, 6 cores, 12 threads
Does not seem to be so clear in this case. Setting it to num_cpus::get_physical() == 6
in this case would not lead to the best performance.
And this is from a server with nproc == 6
. Apparently it's a Intel Xeon E5-2680 (which should have 8/16 cores/threads, but I guess there is some virtualization going on(?).
Reviewers
tmccombs tmccombs approved these changes
sharkdp sharkdp approved these changes