Avoid alloca
s in codegen for simple mir::Aggregate
statements by scottmcm · Pull Request #123886 · rust-lang/rust (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation54 Commits3 Checks6 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
The core idea here is to remove the abstraction penalty of simple newtypes in codegen.
Even something simple like constructing a
#[repr(transparent)] struct Foo(u32);
forces an alloca
to be generated in nightly right now.
Certainly LLVM can optimize that away, but it would be nice if it didn't have to.
Quick example:
#[repr(transparent)] pub struct Transparent32(u32);
#[no_mangle] pub fn make_transparent(x: u32) -> Transparent32 { let a = Transparent32(x); a }
on nightly we produce https://rust.godbolt.org/z/zcvoM79ae
define noundef i32 @make_transparent(i32 noundef %x) unnamed_addr #0 { %a = alloca i32, align 4 store i32 %x, ptr %a, align 4 %0 = load i32, ptr %a, align 4, !noundef !3 ret i32 %0 }
but after this PR we produce
define noundef i32 @make_transparent(i32 noundef %x) unnamed_addr #0 { start: ret i32 %x }
(even before the optimizer runs).
r? @fee1-dead
rustbot has assigned @fee1-dead.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.
Use r?
to explicitly pick a reviewer
rustbot added S-waiting-on-review
Status: Awaiting review from the assignee but also interested parties.
Relevant to the compiler team, which will review and decide on the PR/issue.
labels
This comment has been minimized.
bors added a commit to rust-lang-ci/rust that referenced this pull request
Avoid alloca
s in codegen for simple pairs and simple transparent structs
Even something simple like constructing a
#[repr(transparent)] struct Foo(u32);
forces an alloca
to be generated in nightly right now.
Certainly LLVM can optimize that away, but it would be nice if it didn't have to.
Quick example:
#[repr(transparent)]
pub struct Transparent32(u32);
#[no_mangle]
pub fn make_transparent(x: u32) -> Transparent32 {
let a = Transparent32(x);
a
}
on nightly we produce <https://rust.godbolt.org/z/zcvoM79ae>
define noundef i32 `@make_transparent(i32` noundef %x) unnamed_addr #0 {
%a = alloca i32, align 4
store i32 %x, ptr %a, align 4
%0 = load i32, ptr %a, align 4, !noundef !3
ret i32 %0
}
but after this PR we produce
define noundef i32 `@make_transparent(i32` noundef %x) unnamed_addr #0 {
start:
ret i32 %x
}
(even before the optimizer runs).
This comment has been minimized.
This comment has been minimized.
bors added a commit to rust-lang-ci/rust that referenced this pull request
Avoid alloca
s in codegen for simple pairs and simple transparent structs
Even something simple like constructing a
#[repr(transparent)] struct Foo(u32);
forces an alloca
to be generated in nightly right now.
Certainly LLVM can optimize that away, but it would be nice if it didn't have to.
Quick example:
#[repr(transparent)]
pub struct Transparent32(u32);
#[no_mangle]
pub fn make_transparent(x: u32) -> Transparent32 {
let a = Transparent32(x);
a
}
on nightly we produce <https://rust.godbolt.org/z/zcvoM79ae>
define noundef i32 `@make_transparent(i32` noundef %x) unnamed_addr #0 {
%a = alloca i32, align 4
store i32 %x, ptr %a, align 4
%0 = load i32, ptr %a, align 4, !noundef !3
ret i32 %0
}
but after this PR we produce
define noundef i32 `@make_transparent(i32` noundef %x) unnamed_addr #0 {
start:
ret i32 %x
}
(even before the optimizer runs).
☀️ Try build successful - checks-actions
Build commit: 1d0d3d3 (1d0d3d3e219daac686e2fd9110e09772b6d18dfc
)
1 similar comment
This comment was marked as duplicate.
This comment has been minimized.
// According to `rvalue_creates_operand`, only ZST |
---|
// aggregate rvalues are allowed to be operands. |
// repat rvalues are allowed to be operands. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
repeat
Comment on lines 731 to 729
let ty = self.monomorphize(ty); |
---|
let layout = self.cx.layout_of(self.monomorphize(ty)); |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is second call to monomorphize
required?
This comment was marked as outdated.
(I've seen regex and wg-grammar being noisy in other PRs)
This comment has been minimized.
@matthewjasper just wanted to make sure you have a notification for this review since the r?
might not have sent you one.
📌 Commit c38f75c has been approved by matthewjasper
It is now in the queue for this repository.
bors added S-waiting-on-bors
Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
and removed S-waiting-on-review
Status: Awaiting review from the assignee but also interested parties.
labels
This was referenced
May 10, 2024
Finished benchmarking commit (6e1d947): comparison URL.
Overall result: ❌✅ regressions and improvements - ACTION NEEDED
Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged
along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged
label to this PR.
@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance
Instruction count
This is a highly reliable metric that was used to determine the overall result at the top of this comment.
mean | range | count | |
---|---|---|---|
Regressions ❌ (primary) | 0.2% | [0.2%, 0.3%] | 6 |
Regressions ❌ (secondary) | 0.5% | [0.2%, 1.3%] | 4 |
Improvements ✅ (primary) | -0.6% | [-1.0%, -0.3%] | 6 |
Improvements ✅ (secondary) | -0.7% | [-1.1%, -0.4%] | 2 |
All ❌✅ (primary) | -0.2% | [-1.0%, 0.3%] | 12 |
Max RSS (memory usage)
Results
This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
mean | range | count | |
---|---|---|---|
Regressions ❌ (primary) | 3.4% | [3.4%, 3.4%] | 1 |
Regressions ❌ (secondary) | 4.8% | [4.8%, 4.8%] | 1 |
Improvements ✅ (primary) | -1.8% | [-3.0%, -0.6%] | 4 |
Improvements ✅ (secondary) | - | - | 0 |
All ❌✅ (primary) | -0.8% | [-3.0%, 3.4%] | 5 |
Cycles
This benchmark run did not return any relevant results for this metric.
Binary size
Results
This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
mean | range | count | |
---|---|---|---|
Regressions ❌ (primary) | - | - | 0 |
Regressions ❌ (secondary) | - | - | 0 |
Improvements ✅ (primary) | -0.7% | [-1.6%, -0.0%] | 73 |
Improvements ✅ (secondary) | -0.5% | [-1.6%, -0.0%] | 11 |
All ❌✅ (primary) | -0.7% | [-1.6%, -0.0%] | 73 |
Bootstrap: 673.906s -> 673.586s (-0.05%)
Artifact size: 315.77 MiB -> 315.84 MiB (0.02%)
scottmcm deleted the more-rvalue-operands branch
Nice binary size reductions. They are almost entirely in debug builds, indicating that "Certainly LLVM can optimize that away" is correct but that only happens in opt builds :)
This was referenced
May 11, 2024
Yeah; I was hoping that LLVM not needing to optimize them away would be more of a perf win in opt, but I'll take instruction-neutral improvements to debug build perf too :)
std::cell::Cell::new(b) |
---|
} |
// CHECK-LABLE: { i8, i16 } @make_cell_of_bool_and_short(i1 noundef zeroext %b, i16 noundef %s) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo here: "CHECK LABLE" should be "LABEL". The test is presumably not working properly.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye; thank you. I've added a fix for that to #124999
bors added a commit to rust-lang-ci/rust that referenced this pull request
RalfJung pushed a commit to RalfJung/miri that referenced this pull request
bors added a commit to rust-lang-ci/rust that referenced this pull request
Unify Rvalue::Aggregate
paths in cg_ssa
In rust-lang#123840 and rust-lang#123886 I added two different codepaths for Rvalue::Aggregate
in cg_ssa
.
This merges them into one, since raw pointers are also immediates that can be built from the immediates of their "fields".
Nice binary size reductions. The regressions are limited to a single benchmark (bitmaps
), while the wins are across three different (primary) benchmarks.
@rustbot label: +perf-regression-triaged
bors added a commit to rust-lang-ci/rust that referenced this pull request
Allow enum
and union
literals to also create SSA values
Today, Some(x)
always goes through an alloca
, even in trivial cases where the niching means the constructor doesn't even change the value.
For example, <https://rust.godbolt.org/z/6KG6PqoYz>
pub fn demo(r: &i32) -> Option<&i32> {
Some(r)
}
currently emits the IR
define align 4 ptr `@demo(ptr` align 4 %r) unnamed_addr {
start:
%_0 = alloca [8 x i8], align 8
store ptr %r, ptr %_0, align 8
%0 = load ptr, ptr %_0, align 8
ret ptr %0
}
but with this PR it becomes just
define align 4 ptr `@demo(ptr` align 4 %r) unnamed_addr {
start:
ret ptr %r
}
(Of course the optimizer can clean that up, but it'd be nice if it didn't have to -- especially in debug where it doesn't run. This is like rust-lang#123886, but that only handled non-simd struct
s -- this PR generalizes it to all non-simd ADTs.)
There's two commits you can review independently:
- The first is simplifying how the aggregate handling works. Past-me wrote something overly complicated, needing arrayvecs and zipping, depending on a careful iteration order of the fields, and fragile enough that even for just structs it needed extra double-checks to make sure it even made the right variant. It's replaced with something far more direct that works just like
extract_field
: use the offset to put it in exactly the correct immediate in theOperandValue
. This doesn't support anything new, just refactors -- including moving some things offFunctionCx
that had no reason to be there. (I have no idea why my past self put them there.) - The second extends that work to support more ADTs. That means handing variants other than
FIRST_VARIANT
, handling the active field for unions, refactoring the discriminant code so the Place and Operand parts can share the calculation, etc.