emscripten (original) (raw)

Background

The Emterpreter and Asyncify features help with problem of synchronous source code that can't run synchronously on the Web, like a synchronous network access in C, etc. In general main loops can be rewritten using the emscripten_set_main_loop method, but that doesn't work for many other things, where it would just be too much work to rewrite code with many synchronous operations.

Asyncify

Asyncify transforms the CFG at the LLVM level. It makes it possible to jump from the program entry to any location that an async operation might happen, so that you can leave the function and resume there later. It also can save and reload all the values in LLVM's virtual registers.

Asyncify works well, but on large functions every new edge that is added can lead to a great many phis, exponentially so in bad cases. This can lead to very large amounts of code.

Emterpreter

The second effort in this space, the Emterpreter, aimed to avoid Ayncify's worst-case code size blowups by running interpreted bytecode. There is then a guarantee that the bytecode is of similar size to the original code (usually smaller, in fact). While the code can run much slower, the hope was that projects can selectively emterpret only the necessary code, for example, the main loop might be emterpreted but physics simulation code that it calls does not need to, if async operations do not happen during physics. And a bytecode made sense because it also allowed experimenting with other things like fast startup (avoiding JS parse time or asm.js AOT time).

The Emterpreter works fairly well, and it does turn out that selective emterpretation is practical in many cases. However, the other use cases for a bytecode like fast startup have become irrelevant in time, as browsers have improved their startup speeds for JS and wasm.

The current challenge

We are close to switching our backend from fastcomp to the upstream LLVM backend. Both Asyncify and the Emterpreter can't work with that backend, Asyncify because it's an LLVM pass inside fastcomp, and the Emterpreter because it runs on asm.js output.

In general we hoped that an async solution for the wasm backend may not be needed, if threads show up fast enough, as they allow pausing and resuming execution at least on background threads. However, Spectre has slowed adoption of threads, with only Chrome shipping it currently, and only on desktop. And in any case, only allowing it on the main thread is limiting as well.

One option here might be to compile wasm to JS using wasm2js, then run the normal Emterpreter code on that. A problem though is that it might run more slowly than the current Emterpreter, since wasm2js does not emit validating asm.js (it emits more flexible JS in order to be smaller and support more wasm features). Running the emterpreted code more slowly might be fine, but it also means running the rest more slowly, unless we split it out and don't convert all the original program to JS, which is possible but not trivial. So we don't have a quick solution here, and must do at least some amount of actual work.

Proposal

We can implement a new solution in Binaryen, using the lessons learned from Asyncify and the Emterpreter:

A Binaryen pass can break up functions into basic blocks, emulating control flow between them using a loop in a switch. This allows leaving and resuming code fairly easily. We can then also add the code to save and load the locals as needed.
Doing this at the Binaryen level means we don't need to add phis. We will have mostly the same amount of locals as before, with the addition of one for the control flow emulation, and extra ones to handle corner cases of nested control flow that we must flatten out. But code size and local counts will probably not increase by more than a factor of 2 or so, so this should be smaller and more efficient than Asyncify (which had the right idea overall, but doing it at the LLVM level was the problem.)
As a future optimization, we can emulate control flow only around calls where an async operation might begin. But based on the Emterpreter experience, focusing on speed may not be that crucial here, as users can selectively apply this transformation only to the necessary code. And this will be much faster than the Emterpreter even if it emulates all control flow, since each basic block will run at full speed.