Build `rustc` with a single CGU on x64 Linux by Kobzol · Pull Request #115554 · rust-lang/rust (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation77 Commits1 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

This PR adds the rust.codegen-units=1 setting when compiling the 64-bit Linux rustc artifact (the one used for try builds and Linux rustup distribution). This had mixed results in the past, however after the bump to LLVM 17, the results now seem pretty incredible. Instruction counts, cycles, wall time, max RSS and even artifact sizes see large improvements.

The last try build with this setting took 1h 8m, which is basically the same duration for try builds that we have seen recently. So there shouldn't be any large hit to CI/build time.

I hope that this could potentially also reduce codegen noise of rustc a little bit, since small changes within a single rustc crate should no longer perturb optimizations because of CGU movement. We still do cross-crate LTO, so it won't eliminate it though.

r? @Mark-Simulacrum

rustbot added A-testsuite

Area: The testsuite used to check the correctness of rustc

S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

T-infra

Relevant to the infrastructure team, which will review and decide on the PR/issue.

labels

Sep 5, 2023

This comment has been minimized.

⌛ Trying commit 1bffa88 with merge 2d8544092e7b47541698b5a6425f9f1aa131b994...

Why only on the Linux builder? This should be done on every builder.

Sure, but we don't have to do that in a single step. We can easily gauge the CI build time and compile performance effects of this on Linux, thanks to try builds and perf.RLO, but it's not that easy for other targets. It's possible that on Windows/macOS this might be a hit to CI times, so we should first investigate that (along with checking how it affects their performance, although that should hopefully be an improvement).

It's also possible that this might cause some issues for nightly users, so I would like to test it for a few weeks on nightly on Linux first, before we enable it globally across the board.

It's possible that on Windows/macOS this might be a hit to CI times, so we should first investigate that (along with checking how it affects their performance, although that should hopefully be an improvement)

Are we expecting LVM17 to be different from our previous measurements of CI times and performance?

Well, some things might have changed in the meantime. Apple CI is now much faster (so that's a good thing). And performance has changed a lot on Linux (but again, towards improvements, so hopefully this will be true on other platforms too). It looks like things are improving in general, but I'd still like to do this on Linux first to see if there are no unexpected issues with 1 CGU.

☀️ Try build successful - checks-actions
Build commit: 2d8544092e7b47541698b5a6425f9f1aa131b994 (2d8544092e7b47541698b5a6425f9f1aa131b994)

This comment has been minimized.

I'm only talking about the comment about other builders, not this PR or the landing strategy: do we need to re-run the 1 CGU PRs on the other builders to gather up-to-date timings on LLVM17 and the new apple builder?

I think that it would be good, yes. It also depends how we want to enable it for other builders. We can allowlist it for Windows and macOS and test these two separately, or just enable it across the board for everything and then take a look at the total CI time.

Finished benchmarking commit (2d8544092e7b47541698b5a6425f9f1aa131b994): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	0.8%	[0.3%, 2.2%]	36
Regressions ❌ (secondary)	1.4%	[0.3%, 2.4%]	21
Improvements ✅ (primary)	-1.2%	[-4.1%, -0.3%]	89
Improvements ✅ (secondary)	-1.7%	[-3.6%, -0.4%]	114
All ❌✅ (primary)	-0.6%	[-4.1%, 2.2%]	125

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-2.1%	[-4.9%, -0.6%]	74
Improvements ✅ (secondary)	-3.0%	[-7.6%, -0.8%]	109
All ❌✅ (primary)	-2.1%	[-4.9%, -0.6%]	74

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.7%	[2.4%, 3.1%]	4
Improvements ✅ (primary)	-2.6%	[-6.9%, -0.9%]	101
Improvements ✅ (secondary)	-4.1%	[-15.4%, -0.9%]	121
All ❌✅ (primary)	-2.6%	[-6.9%, -0.9%]	101

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 627.535s -> 623.723s (-0.61%)
Artifact size: 316.30 MiB -> 271.45 MiB (-14.18%)

📌 Commit 1bffa88 has been approved by Mark-Simulacrum

It is now in the queue for this repository.

bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-review

Status: Awaiting review from the assignee but also interested parties.

labels

Sep 10, 2023

bors added a commit to rust-lang-ci/rust that referenced this pull request

Sep 10, 2023

Build rustc with a single CGU on x64 Linux

This PR adds the rust.codegen-units=1 setting when compiling the 64-bit Linux rustc artifact (the one used for try builds and Linux rustup distribution). This had mixed results in the past, however after the bump to LLVM 17, the results now seem pretty [incredible](rust-lang#115554 (comment)). Instruction counts, cycles, wall time, max RSS and even artifact sizes see large improvements.

The last try build with this setting took 1h 8m, which is basically the same duration for try builds that we have seen recently. So there shouldn't be any large hit to CI/build time.

I hope that this could potentially also reduce codegen noise of rustc a little bit, since small changes within a single rustc crate should no longer perturn optimizations because of CGU movement. We still do cross-crate LTO, so it won't eliminate it though.

r? @Mark-Simulacrum

But at the same time, libstd.so size didn't changed at all, strange.

This comment has been minimized.

📌 Commit ca59652 has been approved by Mark-Simulacrum

It is now in the queue for this repository.

bors added S-waiting-on-bors

Status: Waiting on bors to run and complete tests. Bors will change the label on completion.

and removed S-waiting-on-author

Status: This is awaiting some action (such as code changes or more information) from the author.

labels

Oct 1, 2023

lqd mentioned this pull request

Oct 1, 2023

Finished benchmarking commit (871407a): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	0.8%	[0.3%, 2.6%]	39
Regressions ❌ (secondary)	1.6%	[0.7%, 2.5%]	23
Improvements ✅ (primary)	-1.2%	[-3.9%, -0.2%]	89
Improvements ✅ (secondary)	-1.6%	[-3.5%, -0.4%]	124
All ❌✅ (primary)	-0.6%	[-3.9%, 2.6%]	128

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	2.9%	[2.9%, 2.9%]	1
Improvements ✅ (primary)	-2.5%	[-4.6%, -0.5%]	90
Improvements ✅ (secondary)	-3.0%	[-6.7%, -0.6%]	118
All ❌✅ (primary)	-2.5%	[-4.6%, -0.5%]	90

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	6.0%	[4.7%, 7.3%]	6
Improvements ✅ (primary)	-2.2%	[-5.9%, -0.6%]	134
Improvements ✅ (secondary)	-3.7%	[-13.9%, -1.0%]	140
All ❌✅ (primary)	-2.2%	[-5.9%, -0.6%]	134

Binary size

This benchmark run did not return any relevant results for this metric.

Bootstrap: 630.664s -> 627.128s (-0.56%)
Artifact size: 317.36 MiB -> 273.31 MiB (-13.88%)

Cycle count, Max RSS and artifact size is overwhelmingly positive.

@rustbot label: +perf-regression-triaged

Summary of perf results:

icounts: -0.49% mean change across all measurements
cycles: -1.91% (ditto)
wall-time: -1.57% (ditto)
max-rss: -1.96% (ditto)
librustc_driver size: -31.6%
rustdoc artifact size: -20.7%
cargo artifact size: -19.9%

Fantastic results!

The bootstrap time also went down a tiny bit (-0.56%) but that is misleading. The bootstrap benchmark doesn't use the same configuration as the shipped compiler: no codegen-units=1 and no PGO/BOLT/LTO. @Kobzol told me that try build times have gone from something like 1h8m to 1h14m.

This was referenced

Oct 1, 2023

pvdrz mentioned this pull request

Oct 4, 2023

bors-ferrocene bot added a commit to ferrocene/ferrocene that referenced this pull request

Oct 5, 2023

RalfJung pushed a commit to RalfJung/miri that referenced this pull request

Oct 6, 2023

Build rustc with a single CGU on x64 Linux

This PR adds the rust.codegen-units=1 setting when compiling the 64-bit Linux rustc artifact (the one used for try builds and Linux rustup distribution). This had mixed results in the past, however after the bump to LLVM 17, the results now seem pretty [incredible](rust-lang/rust#115554 (comment)). Instruction counts, cycles, wall time, max RSS and even artifact sizes see large improvements.

The last try build with this setting took 1h 8m, which is basically the same duration for try builds that we have seen recently. So there shouldn't be any large hit to CI/build time.

r? @Mark-Simulacrum

bors-ferrocene bot added a commit to ferrocene/ferrocene that referenced this pull request

Oct 6, 2023

bors-ferrocene bot added a commit to ferrocene/ferrocene that referenced this pull request

Oct 9, 2023

bors-ferrocene bot added a commit to ferrocene/ferrocene that referenced this pull request

Oct 9, 2023

bors added a commit to rust-lang-ci/rust that referenced this pull request

Oct 9, 2023

Optimize librustc_driver.so with BOLT

This PR optimizes librustc_driver.so on 64-bit Linux CI with BOLT.

Code

One thing that's not clear yet to me how to resolve is how to best pass a linker flag that we need for BOLT (the second commit). It is currently passed unconditionally, which is not a good idea. We somehow have to:

Only pass it when we actually plan to use BOLT. How to best do that? config.toml entry? Environment variable? CLI flag for bootstrap? BOLT optimization is done by opt-dist, therefore bootstrap doesn't know about it by default.
Only pass it to librustc_driver.so (see performance below).

Some discussion of this flag already happened on Zulip.

Performance

Latest perf. results can be found [here](rust-lang#102487 (comment)). Note that instruction counts are not very interesting here, there are only regressions on hello world programs. Probably caused by a larger C++ libstd (?).

Summary:

✔️ -1.8% mean improvement in cycle counts across many primary benchmarks.
✔️ -1.8% mean Max-RSS improvement.
✖️ 34 MiB (+48%) artifact size regression of librustc_driver.so.
- This is caused by building librustc_driver.so with relocations (which are required for BOLT). Hopefully, it will be fixed in the future with BOLT improvements, but now trying to reduce this size increase is tricky.
- Note that the size of this file was recently reduced in rust-lang#115554 by pretty much the same amount (33 MiB). So the size after this PR is basically the same as it was for the last ~year.
✖️ 1.4 MiB (+53%) artifact size regression of rustc.
- This is annoying and pretty much unnecessary. It is caused by the way relocations are currently applied in this PR, because they are applied both to librustc_driver.so (where they are needed) and for rustc (where they aren't needed), since both are built with a single cargo invocation. We might need e.g. some tricks in the bootstrap rustc shim to only apply the relocation flag for the shared library and not for rustc.

CI time

CI (try build) got slower by ~5 minutes, which is fine, IMO. It can be further reduced by running LLVM and librustc_driver BOLT profile gathering at the same time (now they are gathered separately for LLVM and librustc_driver).

r? @Mark-Simulacrum

Also CC @onur-ozkan, primarily for the bootstrap linker flag issue.

bors added a commit to rust-lang-ci/rust that referenced this pull request

Oct 11, 2023

Optimize librustc_driver.so with BOLT

This PR optimizes librustc_driver.so on 64-bit Linux CI with BOLT.

Code

Only pass it when we actually plan to use BOLT. How to best do that? config.toml entry? Environment variable? CLI flag for bootstrap? BOLT optimization is done by opt-dist, therefore bootstrap doesn't know about it by default.
Only pass it to librustc_driver.so (see performance below).

Some discussion of this flag already happened on Zulip.

Performance

Summary:

✔️ -1.8% mean improvement in cycle counts across many primary benchmarks.
✔️ -1.8% mean Max-RSS improvement.
✖️ 34 MiB (+48%) artifact size regression of librustc_driver.so.
- This is caused by building librustc_driver.so with relocations (which are required for BOLT). Hopefully, it will be fixed in the future with BOLT improvements, but now trying to reduce this size increase is tricky.
- Note that the size of this file was recently reduced in rust-lang#115554 by pretty much the same amount (33 MiB). So the size after this PR is basically the same as it was for the last ~year.
✖️ 1.4 MiB (+53%) artifact size regression of rustc.
- This is annoying and pretty much unnecessary. It is caused by the way relocations are currently applied in this PR, because they are applied both to librustc_driver.so (where they are needed) and for rustc (where they aren't needed), since both are built with a single cargo invocation. We might need e.g. some tricks in the bootstrap rustc shim to only apply the relocation flag for the shared library and not for rustc.

CI time

r? @Mark-Simulacrum

Also CC @onur-ozkan, primarily for the bootstrap linker flag issue.

bors added a commit to rust-lang-ci/rust that referenced this pull request

Oct 14, 2023

…ulacrum

Optimize librustc_driver.so with BOLT

This PR optimizes librustc_driver.so on 64-bit Linux CI with BOLT.

Code

Only pass it when we actually plan to use BOLT. How to best do that? config.toml entry? Environment variable? CLI flag for bootstrap? BOLT optimization is done by opt-dist, therefore bootstrap doesn't know about it by default.
Only pass it to librustc_driver.so (see performance below).

Some discussion of this flag already happened on Zulip.

Performance

Summary:

✔️ -1.8% mean improvement in cycle counts across many primary benchmarks.
✔️ -1.8% mean Max-RSS improvement.
✖️ 34 MiB (+48%) artifact size regression of librustc_driver.so.
- This is caused by building librustc_driver.so with relocations (which are required for BOLT). Hopefully, it will be fixed in the future with BOLT improvements, but now trying to reduce this size increase is tricky.
- Note that the size of this file was recently reduced in rust-lang#115554 by pretty much the same amount (33 MiB). So the size after this PR is basically the same as it was for the last ~year.
✖️ 1.4 MiB (+53%) artifact size regression of rustc.
- This is annoying and pretty much unnecessary. It is caused by the way relocations are currently applied in this PR, because they are applied both to librustc_driver.so (where they are needed) and for rustc (where they aren't needed), since both are built with a single cargo invocation. We might need e.g. some tricks in the bootstrap rustc shim to only apply the relocation flag for the shared library and not for rustc.

CI time

r? @Mark-Simulacrum

Also CC @onur-ozkan, primarily for the bootstrap linker flag issue.

github-actions bot pushed a commit to rust-lang/miri that referenced this pull request

Oct 17, 2023

Optimize librustc_driver.so with BOLT

This PR optimizes librustc_driver.so on 64-bit Linux CI with BOLT.

Code

Only pass it when we actually plan to use BOLT. How to best do that? config.toml entry? Environment variable? CLI flag for bootstrap? BOLT optimization is done by opt-dist, therefore bootstrap doesn't know about it by default.
Only pass it to librustc_driver.so (see performance below).

Some discussion of this flag already happened on Zulip.

Performance

Latest perf. results can be found [here](rust-lang/rust#102487 (comment)). Note that instruction counts are not very interesting here, there are only regressions on hello world programs. Probably caused by a larger C++ libstd (?).

Summary:

✔️ -1.8% mean improvement in cycle counts across many primary benchmarks.
✔️ -1.8% mean Max-RSS improvement.
✖️ 34 MiB (+48%) artifact size regression of librustc_driver.so.
- This is caused by building librustc_driver.so with relocations (which are required for BOLT). Hopefully, it will be fixed in the future with BOLT improvements, but now trying to reduce this size increase is tricky.
- Note that the size of this file was recently reduced in rust-lang/rust#115554 by pretty much the same amount (33 MiB). So the size after this PR is basically the same as it was for the last ~year.
✖️ 1.4 MiB (+53%) artifact size regression of rustc.
- This is annoying and pretty much unnecessary. It is caused by the way relocations are currently applied in this PR, because they are applied both to librustc_driver.so (where they are needed) and for rustc (where they aren't needed), since both are built with a single cargo invocation. We might need e.g. some tricks in the bootstrap rustc shim to only apply the relocation flag for the shared library and not for rustc.

CI time

r? @Mark-Simulacrum

Also CC @onur-ozkan, primarily for the bootstrap linker flag issue.

lnicola pushed a commit to lnicola/rust-analyzer that referenced this pull request

Apr 7, 2024

Refactor opt-dist to simplify local building

This PR refactors the opt-dist tool to make it easier to invoke it locally, outside of CI, and thus simplify building PGO/BOLT optimized rustc builds e.g. for distro maintainers. It should also make it easier to run the PGO/BOLT workflow locally e.g. to profile performance or debug issues (looking at you, rust-lang/rust#115554).

RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request

Apr 27, 2024

Refactor opt-dist to simplify local building

Labels

A-testsuite

Area: The testsuite used to check the correctness of rustc

merged-by-bors

This PR was explicitly merged by bors.

perf-regression

Performance regression.

perf-regression-triaged

The performance regression has been triaged.