Move coroutine upvars into locals for better memory economy by dingxiangfei2009 · Pull Request #135527 · rust-lang/rust (original) (raw)

Replace #127522
Related to #62958

The problem statement

#62958 demonstrates two problems. One is that upvars are always unconditionally promoted to prefix data fields of the state machine; the other is that the opportunity to achieve a more compact data layout is lost because captured upvars are not subjected to liveness analysis, in the sense that the memory space at one point occupied by upvars is never reclaimed and made available for other saved data across certain yield points, even when they are dead at those suspension locations.

The second problem is better demonstrated with this code snippet.

async fn work(another_fut: impl Future) { let _ = another_fut.await; // now another_fut is consumed let next_fut = async { .. }; next_fut.await; }

// work's layout needs to reserve space for both another_fut and next_fut, while there is a clear missed opportunity // to overlap the memory for another_fut and next_fut for better memory economy.

The difficulty lies with the fact that captured upvars do not receive their own locals inside a coroutine body. If we can assign locals to them somehow, we can run the layout scheme as usual and the optimisation on the data layout comes into effect out of the box in most cases.

Proposed changes

This is an initial work to improve memory economy of coroutine and async futures, by reducing the unnecessary of promotion of captured upvars into state prefix.

The changes are broken into commits for reviews in isolation. Among them, the changes are as follows.

  1. Changes to the UI and diagnostic messages, upfront for results
  2. Introduction of a RelocateUpvar MIR pass that inserts a MIR gadget, through which captured values by coroutine or async bodies or closures are moved into the inner MIR locals. This opens opportunities to subject the captured upvars to the same liveness analysis and determine which are the necessary ones to be stored in the coroutine state during suspension.
  3. With this gadget, it means that we do not have to keep all upvars in the so-called prefix data regions of coroutine states. Instead, they are moved into the Unresumed state, or by convention the first variants of the state ADTs.
  4. In addition, in case that some upvars are eventually used across more than one suspension point, which leads to their promotion into the prefix after all, we further arrange the coroutine state data layout, so that their offsets in the Unresumed state coincide with their memory slots after promotion. This means that during codegen, the additional moves introduced by the RelocateUpvar gadget are actually elided. The relevant change is implemented in rustc_abi.
  5. We then have to pay the lip service to translate direct field access to the upvars into access behind the Unresumed variant.
  6. We have to update diagnostics so that they are more informed about captured values and they make more sense in view of this change.

Other than upvars, the coroutine state data layout scheme remains largely the same.

Further optimisation to be implemented behind a feature gate

Point 4 mentions that any local to be saved across suspensions will be promoted whenever they are alive across two or more yield locations. We would like to run an experiment behind a feature gate on improvements of the layout scheme. For ease of reviewing, it is better to drop this part of work from this PR. Nevertheless, the idea runs along the implementation in #127522 and we intend to propose a second PR just for that.

Old PR descriptionGood day, this PR is related to #127522 and it is made easier to the public to test out a new coroutine/`async` state machine directly.

Prepare the compiler for tests

For starter, you may build the compiler as prescribed in the rustc-dev-guide instruction. If a test in the docker container is desirable, you may build this compiler with src/ci/docker/run.sh dist-x86_64-linux --dev for x86_64 and package the compiler with ../x dist to produce the artifacts in obj/dist-x86_64-linux/build/dist. This Dockerfile gets you a working Rust builder image which allows you to build your Rust applications in bookworm.

The state of performance

So far with this patch, I have been studying the performance impact on the cases of tokio's single- and multi-threaded runtime, as well as a simple axum HTTP service. As far as I can see, I can find a change in performance characteristics that are statistically significant, one-sided p = 0.05.

This time, I would like to call for pooling in your valuable assessments and thoughts on this patch. I kindly request experiments from you and hopefully you can provide regression cases with perf record -e cycles:u,instructions:u,cache-misses:u reports.

Thank you all so much! 🙇