x86: use SSE2 to pass float and SIMD types by RalfJung · Pull Request #135408 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation183 Commits1 Checks6 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

RalfJung

This builds on the new X86Sse2 ABI landed in #137037 to actually make it a separate ABI from the default x86 ABI, and use SSE2 registers. Specifically, we use it in two ways: to return f64 values in a register rather than by-ptr, and to pass vectors of size up to 128bit in a register (or, well, whatever LLVM does when passing <4 x float> by-val, I don't actually know if this ends up in a register).

Cc @workingjubilee
Fixes #133611

try-job: aarch64-apple
try-job: aarch64-gnu
try-job: aarch64-gnu-debug
try-job: test-various
try-job: x86_64-gnu-nopt
try-job: dist-i586-gnu-i586-i686-musl
try-job: x86_64-msvc-1

@rustbot

r? @SparrowLii

rustbot has assigned @SparrowLii.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added O-apple

Operating system: Apple (macOS, iOS, tvOS, visionOS, watchOS)

S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

T-compiler

Relevant to the compiler team, which will review and decide on the PR/issue.

labels

Jan 12, 2025

@rustbot

RalfJung

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

RalfJung

//@[sse] needs-llvm-components: x86
// We make SSE available but don't use it for the ABI.
//@[nosse] compile-flags: --target i586-unknown-linux-gnu -Ctarget-feature=+sse2 -Ctarget-cpu=pentium4
//@[nosse] needs-llvm-components: x86

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tidy is being silly and doesn't let us set the needs-llvm-components: x86 uniformly for all revisions.

RalfJung

// FIXME: the MIR opt still works, but the ABI logic now introduces
// an alloca here.
// CHECK: alloca
// CHECK: store <4 x float> %x, ptr %_0, align 16

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what this was trying to test, but it probably doesn't test that any more. The alloca is emitted by the ABI handling, and this test disables LLVM optimizations, so there's no way we can avoid the alloca.

It seems like this is intended to test mir-opts, but then why is it not a mir-opt test...?

Cc @scottmcm (who added the test in d757c4b)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a regression test for an ICE in cg_ssa: d757c4b

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why would that be a codegen test...?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably the intent was that the alloca didn't come from the transmute -- since transmute used to always just make an alloca and read-write it -- but certainly if the alloca is from the ABI handling it's fine to have it there.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scottmcm So should I remove the test then? Or is there some way to still test what actually matters here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing the test; I can't figure out how to write a test that does not contain alloca. We always emit alloca for the implicit transmutes caused by the ABI, so there's no way to reasonably test for their absence in a test that disables LLVM opts.

@scottmcm is it worth opening an issue to track improving our codegen for the implicit ABI transmutes? It would maybe use some of the same optimizations as explicit transmutes... or maybe not, I have no idea.

RalfJung

// CHECK-NEXT: store <3 x float> [[VREG]], ptr [[RET_VREG]], [[RET_ALIGN]]
// CHECK-NEXT: ret void
// opt3-NEXT: ret <3 x float> [[VREG:%[a-z0-9_]+]]
// noopt: ret <3 x float> [[VREG:%[a-z0-9_]+]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea if this test still makes any sense for the "noopt" revision... it seems like for some reason the call to load does not get inlined any more or so?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OTOH the "noopt" test was already very odd before... the load <3 x float> there referred to loading the return value of load() which was returned into the alloca.

So I made this care only about opt3 for the square_packed_full part of the test.

RalfJung

@@ -1,4 +1,5 @@
//@ revisions:opt3 noopt
//@ only-x86_64

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is checking our LLVM ABI lowering as much as it is checking anything packed-simd specific, so it will be very hard to make this work uniformly across targets that use a different ABI lowering.

Noratrieb

bjorn3

bjorn3

bjorn3

{
// This is a single scalar that fits into an SSE register.
// FIXME: We cannot return 128-bit-floats this way since scalars larger than
// 64bit must be returned indirectly to make cranelift happy. See the comment
// in `adjust_for_rust_abi`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f128 floats are only partially supported by Cranelift, but even so returning them in a vector register should work just fine I think. Returning f128 in a vector register doesn't have the same issue that returning i128 in integer registers has as f128 fits in a single vector register, while i128 doesn't fit in a single integer register.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the way the "return large things indirectly" is implemented is not great, it leads to ICEs if other adjustments have already decided the ABI for one of these return types: make_indirect cannot be called if someone already called cast_to.

IMO this is a backend bug, backends should support all scalar types as return types.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

RalfJung

@@ -6,6 +6,9 @@
//@ normalize-stdout: "libthe_backend.dylib" -> "libthe_backend.so"
//@ normalize-stdout: "the_backend.dll" -> "libthe_backend.so"
// Pick a target that requires no target features, so that no warning is shown
// about missing target features.
//@ compile-flags: --target arm-unknown-linux-gnueabi

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bjorn3 does the cranelift backend return anything for target_features_cfg? If not, there might be warnings now about missing target features, depending on the ABI info for the current target.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it currently hard codes sse and sse2 for all x86_64 targets that are not bare-metal:

fn target_features_cfg(
&self,
sess: &Session,
_allow_unstable: bool,
) -> Vec<rustc_span::Symbol> {
// FIXME return the actually used target features. this is necessary for #[cfg(target_feature)]
if sess.target.arch == "x86_64" && sess.target.os != "none" {
// x86_64 mandates SSE2 support
vec![sym::fsxr, sym::sse, sym::sse2]
} else if sess.target.arch == "aarch64" {
match &*sess.target.os {
"none" => vec![],
// On macOS the aes, sha2 and sha3 features are enabled by default and ring
// fails to compile on macOS when they are not present.
"macos" => vec![sym::neon, sym::aes, sym::sha2, sym::sha3],
// AArch64 mandates Neon support
_ => vec![sym::neon],
}
} else {
vec![]
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests/ui-fulldeps/codegen-backend/ doesn't actually use cg_clif. It uses the backend in tests/ui-fulldeps/codegen-backend/auxiliary/the_backend.rs. It is fine to implement target_features_cfg there as always returning sse and sse2. It doesn't compile anything to machine code anyway. It is just a test that -Zcodegen-backend with an external codegen backend functions.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I get that. But it still seemed easiest to just use a target that doesn't require any features.

@rust-log-analyzer

This comment has been minimized.

workingjubilee

@bors

@bors bors added S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

and removed S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

labels

Feb 18, 2025

@RalfJung RalfJung changed the titlex86: make SSE2 required for i686 hardfloat targets and use it to pass SIMD types x86: use SSE2 to pass float and SIMD types

Feb 18, 2025

@RalfJung

@bors

bors added a commit to rust-lang-ci/rust that referenced this pull request

Feb 18, 2025

@bors

x86: use SSE2 to pass float and SIMD types

This builds on the new X86Sse2 ABI landed in rust-lang#137037 to actually make it a separate ABI from the default x86 ABI, and use SSE2 registers. Specifically, we use it in two ways: to return f64 values in a register rather than by-ptr, and to pass vectors of size up to 128bit in a register (or, well, whatever LLVM does when passing <4 x float> by-val, I don't actually know if this ends up in a register).

Cc @workingjubilee Fixes rust-lang#133611

try-job: aarch64-apple try-job: aarch64-gnu try-job: aarch64-gnu-debug try-job: test-various try-job: x86_64-gnu-nopt try-job: dist-i586-gnu-i586-i686-musl try-job: x86_64-msvc-1

@rust-log-analyzer

This comment has been minimized.

@RalfJung

@RalfJung

bors added a commit to rust-lang-ci/rust that referenced this pull request

Feb 18, 2025

@bors

x86: use SSE2 to pass float and SIMD types

This builds on the new X86Sse2 ABI landed in rust-lang#137037 to actually make it a separate ABI from the default x86 ABI, and use SSE2 registers. Specifically, we use it in two ways: to return f64 values in a register rather than by-ptr, and to pass vectors of size up to 128bit in a register (or, well, whatever LLVM does when passing <4 x float> by-val, I don't actually know if this ends up in a register).

Cc @workingjubilee Fixes rust-lang#133611

try-job: aarch64-apple try-job: aarch64-gnu try-job: aarch64-gnu-debug try-job: test-various try-job: x86_64-gnu-nopt try-job: dist-i586-gnu-i586-i686-musl try-job: x86_64-msvc-1

@bors

@rust-log-analyzer

This comment was marked as outdated.

@bors

☀️ Try build successful - checks-actions
Build commit: 0937552 (09375525af5bce535584f59b98b0e97eaf5c49ed)

@workingjubilee

@bors

📌 Commit 803feb5 has been approved by workingjubilee

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Feb 18, 2025

@bors

@bors

@rust-timer

Finished benchmarking commit (17c1c32): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌ (primary) - - 0
Regressions ❌ (secondary) - - 0
Improvements ✅ (primary) - - 0
Improvements ✅ (secondary) -0.3% [-0.3%, -0.3%] 1
All ❌✅ (primary) - - 0

Max RSS (memory usage)

Results (primary -1.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌ (primary) - - 0
Regressions ❌ (secondary) - - 0
Improvements ✅ (primary) -1.1% [-1.1%, -1.1%] 1
Improvements ✅ (secondary) - - 0
All ❌✅ (primary) -1.1% [-1.1%, -1.1%] 1

Cycles

Results (secondary -9.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌ (primary) - - 0
Regressions ❌ (secondary) - - 0
Improvements ✅ (primary) - - 0
Improvements ✅ (secondary) -9.3% [-9.3%, -9.3%] 1
All ❌✅ (primary) - - 0

Binary size

Results (primary -0.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌ (primary) - - 0
Regressions ❌ (secondary) - - 0
Improvements ✅ (primary) -0.0% [-0.0%, -0.0%] 16
Improvements ✅ (secondary) - - 0
All ❌✅ (primary) -0.0% [-0.0%, -0.0%] 16

Bootstrap: 775.604s -> 773.157s (-0.32%)
Artifact size: 360.33 MiB -> 360.31 MiB (-0.01%)

Labels

A-testsuite

Area: The testsuite used to check the correctness of rustc

merged-by-bors

This PR was explicitly merged by bors.

O-apple

Operating system: Apple (macOS, iOS, tvOS, visionOS, watchOS)

O-x86_32

Target: x86 processors, 32 bit (like i686-*) (IA-32)

relnotes

Marks issues that should be documented in the release notes of the next release.

S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

T-compiler

Relevant to the compiler team, which will review and decide on the PR/issue.

T-infra

Relevant to the infrastructure team, which will review and decide on the PR/issue.