Optimize core::str::Lines::count by thomcc · Pull Request #123606 · rust-lang/rust (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

thomcc

@thomcc

@rustbot rustbot added S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

T-libs

Relevant to the library team, which will review and decide on the PR/issue.

labels

Apr 7, 2024

@thomcc thomcc added S-waiting-on-author

Status: This is awaiting some action (such as code changes or more information) from the author.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Apr 7, 2024

@thomcc

@thomcc

bors added a commit to rust-lang-ci/rust that referenced this pull request

Apr 17, 2024

@bors

Optimize core::str::Lines::count

s.lines().count()+1 is somewhat common as a way to find the line number given a byte position, so it'd be nice if it were faster.

This just generalizes the SWAR-optimized char counting code so that it can be used for SWAR-optimized line counting, so it's actually not very complex of a PR.

TODO

Benchmarks

case00_libcore is the new version, and case01_fold_increment is the previous implementation (the default impl of Iterator::count() is a fold that increments

    str::line_count::all_newlines_32kib::case00_libcore           4.35µs/iter  +/- 11.00ns
    str::line_count::all_newlines_32kib::case01_fold_increment  779.99µs/iter   +/- 8.43µs
    str::line_count::all_newlines_4kib::case00_libcore          562.00ns/iter   +/- 5.00ns
    str::line_count::all_newlines_4kib::case01_fold_increment    97.81µs/iter   +/- 1.48µs
    str::line_count::all_newlines_64b::case00_libcore            21.00ns/iter   +/- 0.00ns
    str::line_count::all_newlines_64b::case01_fold_increment      1.49µs/iter  +/- 32.00ns

    str::line_count::en_huge::case00_libcore                     45.58µs/iter +/- 122.00ns
    str::line_count::en_huge::case01_fold_increment             167.62µs/iter +/- 609.00ns
    str::line_count::en_large::case00_libcore                   734.00ns/iter   +/- 6.00ns
    str::line_count::en_large::case01_fold_increment              2.62µs/iter   +/- 9.00ns
    str::line_count::en_medium::case00_libcore                  100.00ns/iter   +/- 0.00ns
    str::line_count::en_medium::case01_fold_increment           347.00ns/iter   +/- 0.00ns
    str::line_count::en_small::case00_libcore                    18.00ns/iter   +/- 1.00ns
    str::line_count::en_small::case01_fold_increment             60.00ns/iter   +/- 2.00ns
    str::line_count::en_tiny::case00_libcore                      6.00ns/iter   +/- 0.00ns
    str::line_count::en_tiny::case01_fold_increment              60.00ns/iter   +/- 0.00ns

    str::line_count::zh_huge::case00_libcore                     40.63µs/iter  +/- 85.00ns
    str::line_count::zh_huge::case01_fold_increment             205.10µs/iter   +/- 1.62µs
    str::line_count::zh_large::case00_libcore                   655.00ns/iter   +/- 1.00ns
    str::line_count::zh_large::case01_fold_increment              3.21µs/iter  +/- 21.00ns
    str::line_count::zh_medium::case00_libcore                   92.00ns/iter   +/- 0.00ns
    str::line_count::zh_medium::case01_fold_increment           420.00ns/iter   +/- 2.00ns
    str::line_count::zh_small::case00_libcore                    20.00ns/iter   +/- 1.00ns
    str::line_count::zh_small::case01_fold_increment             63.00ns/iter   +/- 1.00ns
    str::line_count::zh_tiny::case00_libcore                      6.00ns/iter   +/- 0.00ns
    str::line_count::zh_tiny::case01_fold_increment              21.00ns/iter   +/- 0.00ns

This is a speedup of around 2x-4x most of the time, but for some highly unrealistic scenarios (32KiB of newlines) it's up to almost 200x faster (because the time taken by the version in this PR is not dependent on the number of newlines in the input, but the old version is slower the more newlines are present). It's also much faster for small inputs, especially if they have newlines (10x faster for en_tiny).

Real world cases will vary, don't read too much into these, I would expect 2x-4x speedup in general, since that's what it gets on the most realistic examples.

Obviously a SIMD impl will beat this, but users who are really bottlenecked on this operation should probably just reach for crates.io (even if we provided a SIMD version, libcore can't use runtime CPU feature detection so they'd still be better off with something from crates.io).

1 similar comment