Optimize integer pow by removing the exit branch by mzabaluev · Pull Request #122884 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation39 Commits4 Checks6 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

mzabaluev

The branch at the end of the pow implementations is redundant with multiplication code already present in the loop. By rotating the exit check, this branch can be largely removed, improving code size and reducing instruction cache misses.

Testing on my machine (x86_64, 11th Gen Intel Core i5-1135G7 @ 2.40GHz), the num::int_pow benchmarks improve by some 40% for the unchecked operations and show some slight improvement for the checked operations as well.

@rustbot

r? @Amanieu

rustbot has assigned @Amanieu.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

T-libs

Relevant to the library team, which will review and decide on the PR/issue.

labels

Mar 22, 2024

@Amanieu

@bors

📌 Commit 76d2530 has been approved by Amanieu

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Apr 11, 2024

bors added a commit to rust-lang-ci/rust that referenced this pull request

Apr 12, 2024

@bors

…Amanieu

Optimize integer pow by removing the exit branch

The branch at the end of the pow implementations is redundant with multiplication code already present in the loop. By rotating the exit check, this branch can be largely removed, improving code size and reducing instruction cache misses.

Testing on my machine (x86_64, 11th Gen Intel Core i5-1135G7 @ 2.40GHz), the num::int_pow benchmarks improve by some 40% for the unchecked operations and show some slight improvement for the checked operations as well.

@bors

@rust-log-analyzer

This comment has been minimized.

@bors

@bors bors added S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

and removed S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

labels

Apr 12, 2024

@mzabaluev

The job x86_64-gnu-llvm-18 failed! Check out the build log: (web) (plain)
Click to see the possible cause of the failure (guessed by this bot)

failures:

---- [codegen] tests/codegen/issues/issue-34947-pow-i32.rs stdout ----

error: verification with 'FileCheck' failed
status: exit status: 1
command: "/usr/lib/llvm-18/bin/FileCheck" "--input-file" "/checkout/obj/build/x86_64-unknown-linux-gnu/test/codegen/issues/issue-34947-pow-i32/issue-34947-pow-i32.ll" "/checkout/tests/codegen/issues/issue-34947-pow-i32.rs" "--check-prefix=CHECK" "--check-prefix" "NONMSVC" "--allow-unused-prefixes" "--dump-input-context" "100"
--- stderr -------------------------------
--- stderr -------------------------------
/checkout/tests/codegen/issues/issue-34947-pow-i32.rs:9:17: error: CHECK-NEXT: is not on the line after the previous match
// CHECK-NEXT: mul

I'm not familiar with this check, so I don't understand what's failing here and what should the fix be.

@Amanieu

It seems that your PR has introduced a regression: LLVM is no longer able to optimize pow(5) down to just 3 multiply instructions.

@mzabaluev

It seems that your PR has introduced a regression: LLVM is no longer able to optimize pow(5) down to just 3 multiply instructions.

Does this mean the modified code performs worse in this specific case?

@Amanieu

Yes, it will perform much worse in that specific case since LLVM is unable to optimize the loop away. See https://godbolt.org/z/nMY79Gn8r for the current code that is being generated. It might be possible to re-arrange the code so that you still get the performance benefit of this PR while still letting LLVM optimize the loop, but I'm not sure.

@Amanieu

@RalfJung

@bors r-
(bors sync fixup)

@bors bors added S-waiting-on-author

Status: This is awaiting some action (such as code changes or more information) from the author.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Apr 17, 2024

@oskgo

@mzabaluev

The lack of optimization in case of a small const argument value is unfortunate.

I briefly tried to salvage it by giving the optimizer an easier time without re-introducing redundancy in the dynamic case, but didn't come up with any good ideas.

Maybe an unrolled fast path for argument values in the 0..=6 range? This would feel like an exercise in tricking the optimizer and placating the benchmarks.

@mzabaluev

The branch at the end of the pow implementations is redundant with multiplication code already present in the loop. By rotating the exit check, this branch can be largely removed, improving code size and instruction cache coherence.

@oskgo

If I understand correctly you don't know how to fix the regression in a satisfactory manner, and you're not going to make the argument that the regression is tolerable?

If I'm right you should probably close this. You can always reopen if you get some new inspiration or can find guidance.

@Amanieu

If might be worth trying something with is_val_statically_known to have 2 different paths depending on whether the input argument is a constant.

@mzabaluev

The newly optimized loop has introduced a regression in the case when pow is called with a small constant exponent. LLVM is no longer able to unroll the loop and the generated code is larger and slower than what's expected in tests.

Match and handle small exponent values separately by branching out to an explicit multiplication sequence for that exponent. Powers larger than 6 need more than three multiplications, so these cases are less likely to benefit from this optimization, also such constant exponents are less likely to be used in practice. For uses with a non-constant exponent, this might also provide a performance benefit if the exponent is small and does not vary between successive calls, so the same match arm tends to be taken as a predicted branch.

@mzabaluev

If might be worth trying something with is_val_statically_known to have 2 different paths depending on whether the input argument is a constant.

I will combine this with my suggestion for the statically known case, thanks for the tip!

@oskgo it looks like we've found a way to resolve the regression, don't close this yet.

@mzabaluev

I get this error when trying to use is_val_statically_known inside pow methods:

error: `is_val_statically_known` is not yet stable as a const fn

@mzabaluev

@Amanieu

@mzabaluev

It is what it says on the tin: pow is annotated as const-stable, so it cannot call the const-unstable is_val_statically_known.
Your playground examples don't (can't) use stability attributes.

@Amanieu

Right, in that case maybe it's best to go back to the version with the unroll loop.

@mzabaluev

Oh, I get it: rustc_allow_const_fn_unstable is an item attribute that is enabled by the feature.

@mzabaluev

In the dynamic exponent case, it's preferred to not increase code size, so use solely the loop-based implementation there. This shows about 4% penalty in the variable exponent benchmarks on x86_64.

@oskgo

pinging @rust-lang/wg-const-eval due to new usage of rustc_allow_const_fn_unstable. It should be fine since this PR is purely an optimization and can always be reverted.

@RalfJung

is_val_statically_known is a very harmless intrinsic from a const-eval perspective, so seems fine for me.

@oskgo oskgo added S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

and removed S-waiting-on-author

Status: This is awaiting some action (such as code changes or more information) from the author.

labels

Jul 19, 2024

Amanieu

// This gives the optimizer a way to efficiently inline call sites
// for the most common use cases with constant exponents.
// Currently, LLVM is unable to unroll the loop below.
match exp {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than this special casing could we instead just have the original loop (which LLVM knows how to unroll) for the is_val_statically_known case and your new loop for the non-constant case?

And do the same for all the other pow functions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated pow and wrapped_pow as suggested.
I'm not sure the extra complication is justified for the checked operations, but I guess the optimizer will have better opportunities with the original loop there as well. I will try to make a macro so that uniform code is used everywhere without repetition.

@Amanieu Amanieu added S-waiting-on-author

Status: This is awaiting some action (such as code changes or more information) from the author.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Jul 23, 2024

@rust-log-analyzer

This comment has been minimized.

@mzabaluev

Give LLVM the for original, optimizable loop in pow and wrapped_pow functions in the case when the exponent is statically known.

@Dylan-DPC Dylan-DPC added S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

and removed S-waiting-on-author

Status: This is awaiting some action (such as code changes or more information) from the author.

labels

Aug 13, 2024

@Amanieu

@bors

📌 Commit ac88b33 has been approved by Amanieu

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Aug 13, 2024

bors added a commit to rust-lang-ci/rust that referenced this pull request

Aug 13, 2024

@bors

…iaskrgr

Rollup of 7 pull requests

Successful merges:

r? @ghost @rustbot modify labels: rollup

bors added a commit to rust-lang-ci/rust that referenced this pull request

Aug 14, 2024

@bors

…iaskrgr

Rollup of 7 pull requests

Successful merges:

r? @ghost @rustbot modify labels: rollup

bors added a commit to rust-lang-ci/rust that referenced this pull request

Aug 14, 2024

@bors

…iaskrgr

Rollup of 7 pull requests

Successful merges:

r? @ghost @rustbot modify labels: rollup

rust-timer added a commit to rust-lang-ci/rust that referenced this pull request

Aug 14, 2024

@rust-timer

Rollup merge of rust-lang#122884 - mzabaluev:pow-remove-exit-branch, r=Amanieu

Optimize integer pow by removing the exit branch

The branch at the end of the pow implementations is redundant with multiplication code already present in the loop. By rotating the exit check, this branch can be largely removed, improving code size and reducing instruction cache misses.

Testing on my machine (x86_64, 11th Gen Intel Core i5-1135G7 @ 2.40GHz), the num::int_pow benchmarks improve by some 40% for the unchecked operations and show some slight improvement for the checked operations as well.

Labels

S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

T-libs

Relevant to the library team, which will review and decide on the PR/issue.