Move coroutine upvars into locals for better memory economy by dingxiangfei2009 · Pull Request #135527 · rust-lang/rust (original) (raw)
Replace #127522
Related to #62958
The problem statement
#62958 demonstrates two problems. One is that upvars are always unconditionally promoted to prefix data fields of the state machine; the other is that the opportunity to achieve a more compact data layout is lost because captured upvars are not subjected to liveness analysis, in the sense that the memory space at one point occupied by upvars is never reclaimed and made available for other saved data across certain yield points, even when they are dead at those suspension locations.
The second problem is better demonstrated with this code snippet.
async fn work(another_fut: impl Future) {
let _ = another_fut.await;
// now another_fut
is consumed
let next_fut = async { .. };
next_fut.await;
}
// work
's layout needs to reserve space for both another_fut
and next_fut
, while there is a clear missed opportunity
// to overlap the memory for another_fut
and next_fut
for better memory economy.
The difficulty lies with the fact that captured upvars do not receive their own locals inside a coroutine body. If we can assign locals to them somehow, we can run the layout scheme as usual and the optimisation on the data layout comes into effect out of the box in most cases.
Proposed changes
This is an initial work to improve memory economy of coroutine and async
futures, by reducing the unnecessary of promotion of captured upvars into state prefix.
The changes are broken into commits for reviews in isolation. Among them, the changes are as follows.
- Changes to the UI and diagnostic messages, upfront for results
- Introduction of a
RelocateUpvar
MIR pass that inserts a MIR gadget, through which captured values by coroutine orasync
bodies or closures are moved into the inner MIR locals. This opens opportunities to subject the captured upvars to the same liveness analysis and determine which are the necessary ones to be stored in the coroutine state during suspension. - With this gadget, it means that we do not have to keep all upvars in the so-called
prefix
data regions of coroutine states. Instead, they are moved into theUnresumed
state, or by convention the first variants of the state ADTs. - In addition, in case that some upvars are eventually used across more than one suspension point, which leads to their promotion into the
prefix
after all, we further arrange the coroutine state data layout, so that their offsets in theUnresumed
state coincide with their memory slots after promotion. This means that during codegen, the additional moves introduced by theRelocateUpvar
gadget are actually elided. The relevant change is implemented inrustc_abi
. - We then have to pay the lip service to translate direct field access to the upvars into access behind the
Unresumed
variant. - We have to update diagnostics so that they are more informed about captured values and they make more sense in view of this change.
Other than upvars, the coroutine state data layout scheme remains largely the same.
Further optimisation to be implemented behind a feature gate
Point 4 mentions that any local to be saved across suspensions will be promoted whenever they are alive across two or more yield locations. We would like to run an experiment behind a feature gate on improvements of the layout scheme. For ease of reviewing, it is better to drop this part of work from this PR. Nevertheless, the idea runs along the implementation in #127522 and we intend to propose a second PR just for that.
Old PR descriptionGood day, this PR is related to #127522 and it is made easier to the public to test out a new coroutine/`async` state machine directly.
Prepare the compiler for tests
For starter, you may build the compiler as prescribed in the rustc-dev-guide
instruction. If a test in the docker container is desirable, you may build this compiler with src/ci/docker/run.sh dist-x86_64-linux --dev
for x86_64
and package the compiler with ../x dist
to produce the artifacts in obj/dist-x86_64-linux/build/dist
. This Dockerfile gets you a working Rust builder image which allows you to build your Rust applications in bookworm
.
The state of performance
So far with this patch, I have been studying the performance impact on the cases of tokio
's single- and multi-threaded runtime, as well as a simple axum
HTTP service. As far as I can see, I can find a change in performance characteristics that are statistically significant, one-sided p = 0.05
.
This time, I would like to call for pooling in your valuable assessments and thoughts on this patch. I kindly request experiments from you and hopefully you can provide regression cases with perf record -e cycles:u,instructions:u,cache-misses:u
reports.
Thank you all so much! 🙇