Avoid `alloca`s in codegen for simple `mir::Aggregate` statements by scottmcm · Pull Request #123886 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation54 Commits3 Checks6 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

The core idea here is to remove the abstraction penalty of simple newtypes in codegen.

Even something simple like constructing a

#[repr(transparent)] struct Foo(u32);

forces an alloca to be generated in nightly right now.

Certainly LLVM can optimize that away, but it would be nice if it didn't have to.

Quick example:

#[repr(transparent)] pub struct Transparent32(u32);

#[no_mangle] pub fn make_transparent(x: u32) -> Transparent32 { let a = Transparent32(x); a }

on nightly we produce https://rust.godbolt.org/z/zcvoM79ae

define noundef i32 @make_transparent(i32 noundef %x) unnamed_addr #0 { %a = alloca i32, align 4 store i32 %x, ptr %a, align 4 %0 = load i32, ptr %a, align 4, !noundef !3 ret i32 %0 }

but after this PR we produce

define noundef i32 @make_transparent(i32 noundef %x) unnamed_addr #0 { start: ret i32 %x }

(even before the optimizer runs).

r? @fee1-dead

rustbot has assigned @fee1-dead.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot added S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

T-compiler

Relevant to the compiler team, which will review and decide on the PR/issue.

labels

Apr 13, 2024

This comment has been minimized.

bors added a commit to rust-lang-ci/rust that referenced this pull request

Apr 13, 2024

Avoid allocas in codegen for simple pairs and simple transparent structs

Even something simple like constructing a

#[repr(transparent)] struct Foo(u32);

forces an alloca to be generated in nightly right now.

Certainly LLVM can optimize that away, but it would be nice if it didn't have to.

Quick example:

#[repr(transparent)]
pub struct Transparent32(u32);

#[no_mangle]
pub fn make_transparent(x: u32) -> Transparent32 {
    let a = Transparent32(x);
    a
}

on nightly we produce <https://rust.godbolt.org/z/zcvoM79ae>

define noundef i32 `@make_transparent(i32` noundef %x) unnamed_addr #0 {
  %a = alloca i32, align 4
  store i32 %x, ptr %a, align 4
  %0 = load i32, ptr %a, align 4, !noundef !3
  ret i32 %0
}

but after this PR we produce

define noundef i32 `@make_transparent(i32` noundef %x) unnamed_addr #0 {
start:
  ret i32 %x
}

(even before the optimizer runs).

This comment has been minimized.

bors added a commit to rust-lang-ci/rust that referenced this pull request

Apr 13, 2024

Avoid allocas in codegen for simple pairs and simple transparent structs

Even something simple like constructing a

#[repr(transparent)] struct Foo(u32);

forces an alloca to be generated in nightly right now.

Certainly LLVM can optimize that away, but it would be nice if it didn't have to.

Quick example:

#[repr(transparent)]
pub struct Transparent32(u32);

#[no_mangle]
pub fn make_transparent(x: u32) -> Transparent32 {
    let a = Transparent32(x);
    a
}

on nightly we produce <https://rust.godbolt.org/z/zcvoM79ae>

define noundef i32 `@make_transparent(i32` noundef %x) unnamed_addr #0 {
  %a = alloca i32, align 4
  store i32 %x, ptr %a, align 4
  %0 = load i32, ptr %a, align 4, !noundef !3
  ret i32 %0
}

but after this PR we produce

define noundef i32 `@make_transparent(i32` noundef %x) unnamed_addr #0 {
start:
  ret i32 %x
}

(even before the optimizer runs).

☀️ Try build successful - checks-actions
Build commit: 1d0d3d3 (1d0d3d3e219daac686e2fd9110e09772b6d18dfc)

1 similar comment

This comment was marked as duplicate.

This comment has been minimized.

// According to `rvalue_creates_operand`, only ZST
// aggregate rvalues are allowed to be operands.
// repat rvalues are allowed to be operands.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

repeat

Comment on lines 731 to 729

let ty = self.monomorphize(ty);
let layout = self.cx.layout_of(self.monomorphize(ty));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is second call to monomorphize required?

This comment was marked as outdated.

(I've seen regex and wg-grammar being noisy in other PRs)

This comment has been minimized.

@matthewjasper just wanted to make sure you have a notification for this review since the r? might not have sent you one.

📌 Commit c38f75c has been approved by matthewjasper

It is now in the queue for this repository.

bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

May 10, 2024

This was referenced

May 10, 2024

Finished benchmarking commit (6e1d947): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	0.2%	[0.2%, 0.3%]	6
Regressions ❌ (secondary)	0.5%	[0.2%, 1.3%]	4
Improvements ✅ (primary)	-0.6%	[-1.0%, -0.3%]	6
Improvements ✅ (secondary)	-0.7%	[-1.1%, -0.4%]	2
All ❌✅ (primary)	-0.2%	[-1.0%, 0.3%]	12

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	3.4%	[3.4%, 3.4%]	1
Regressions ❌ (secondary)	4.8%	[4.8%, 4.8%]	1
Improvements ✅ (primary)	-1.8%	[-3.0%, -0.6%]	4
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.8%	[-3.0%, 3.4%]	5

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.7%	[-1.6%, -0.0%]	73
Improvements ✅ (secondary)	-0.5%	[-1.6%, -0.0%]	11
All ❌✅ (primary)	-0.7%	[-1.6%, -0.0%]	73

Bootstrap: 673.906s -> 673.586s (-0.05%)
Artifact size: 315.77 MiB -> 315.84 MiB (0.02%)

scottmcm deleted the more-rvalue-operands branch

May 11, 2024 01:12

Nice binary size reductions. They are almost entirely in debug builds, indicating that "Certainly LLVM can optimize that away" is correct but that only happens in opt builds :)

This was referenced

May 11, 2024

Yeah; I was hoping that LLVM not needing to optimize them away would be more of a perf win in opt, but I'll take instruction-neutral improvements to debug build perf too :)

std::cell::Cell::new(b)
}

// CHECK-LABLE: { i8, i16 } @make_cell_of_bool_and_short(i1 noundef zeroext %b, i16 noundef %s)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo here: "CHECK LABLE" should be "LABEL". The test is presumably not working properly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye; thank you. I've added a fix for that to #124999

bors added a commit to rust-lang-ci/rust that referenced this pull request

May 11, 2024

RalfJung pushed a commit to RalfJung/miri that referenced this pull request

May 12, 2024

bors added a commit to rust-lang-ci/rust that referenced this pull request

May 13, 2024

Unify Rvalue::Aggregate paths in cg_ssa

In rust-lang#123840 and rust-lang#123886 I added two different codepaths for Rvalue::Aggregate in cg_ssa.

This merges them into one, since raw pointers are also immediates that can be built from the immediates of their "fields".

Nice binary size reductions. The regressions are limited to a single benchmark (bitmaps), while the wins are across three different (primary) benchmarks.

@rustbot label: +perf-regression-triaged

bors added a commit to rust-lang-ci/rust that referenced this pull request

Mar 20, 2025

Allow enum and union literals to also create SSA values

Today, Some(x) always goes through an alloca, even in trivial cases where the niching means the constructor doesn't even change the value.

For example, <https://rust.godbolt.org/z/6KG6PqoYz>

pub fn demo(r: &i32) -> Option<&i32> {
    Some(r)
}

currently emits the IR

define align 4 ptr `@demo(ptr` align 4 %r) unnamed_addr {
start:
  %_0 = alloca [8 x i8], align 8
  store ptr %r, ptr %_0, align 8
  %0 = load ptr, ptr %_0, align 8
  ret ptr %0
}

but with this PR it becomes just

define align 4 ptr `@demo(ptr` align 4 %r) unnamed_addr {
start:
  ret ptr %r
}

(Of course the optimizer can clean that up, but it'd be nice if it didn't have to -- especially in debug where it doesn't run. This is like rust-lang#123886, but that only handled non-simd structs -- this PR generalizes it to all non-simd ADTs.)

There's two commits you can review independently:

The first is simplifying how the aggregate handling works. Past-me wrote something overly complicated, needing arrayvecs and zipping, depending on a careful iteration order of the fields, and fragile enough that even for just structs it needed extra double-checks to make sure it even made the right variant. It's replaced with something far more direct that works just like extract_field: use the offset to put it in exactly the correct immediate in the OperandValue. This doesn't support anything new, just refactors -- including moving some things off FunctionCx that had no reason to be there. (I have no idea why my past self put them there.)
The second extends that work to support more ADTs. That means handing variants other than FIRST_VARIANT, handling the active field for unions, refactoring the discriminant code so the Place and Operand parts can share the calculation, etc.

Avoid allocas in codegen for simple mir::Aggregate statements by scottmcm · Pull Request #123886 · rust-lang/rust (original) (raw)

Conversation

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment was marked as duplicate.

This comment has been minimized.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment has been minimized.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Instruction count

Max RSS (memory usage)

Cycles

Binary size

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Avoid `alloca`s in codegen for simple `mir::Aggregate` statements by scottmcm · Pull Request #123886 · rust-lang/rust (original) (raw)