gh-104144: Optimize gather to finish eagerly when all futures complete eagerly by itamaro · Pull Request #104138 · python/cpython (original) (raw)

gh-97696 introduced eager tasks factory, which speeds up some async-heavy workloads by up to 50% when opted in.

installing the eager tasks factory applies out-of-the-box when gathering futures (asyncio.gather(...)), e.g.:

asyncio.get_event_loop().set_task_factory(asyncio.eager_task_factory)
await asyncio.gather(coro1, coro2, coro3)

coro{1,2,3} will eagerly execute the first step, and potentially complete without scheduling to the event loop if the coros don't block.

the implementation of eager uses callbacks internally that end up getting scheduled to the event loop even if all the futures were able to finish synchronously, and blocking the coroutine in which gather() was awaited, preventing the task from completing eagerly even if otherwise it could.

applications that use multiple levels of nested gathers can benefit significantly from eagerly completing multiple levels without blocking, as implemented in this PR by skipping scheduling done callbacks for futures that are already done (e.g. finished eagerly).

Benchmarks

this makes the async pyperformance benchmarks up to 3x faster (!!), using a patch to pyperformance that adds "eager" flavors

3.12-base.20230503.async.4.json
===============================

Performance version: 1.0.7
Python version: 3.12.0a7+ (64-bit) revision da1980afcb
Report on Linux-5.15.0-1033-aws-x86_64-with-glibc2.31
Number of logical CPUs: 72
Start date: 2023-05-03 23:27:23.329046
End date: 2023-05-03 23:46:37.706326

3.12-nogf.20230503.async.2.json
===============================

Performance version: 1.0.7
Python version: 3.12.0a7+ (64-bit) revision 5397cd9f62
Report on Linux-5.15.0-1033-aws-x86_64-with-glibc2.31
Number of logical CPUs: 72
Start date: 2023-05-03 23:05:45.011427
End date: 2023-05-03 23:22:44.908094

+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+
| Benchmark                     | 3.12-base.20230503.async.4.json | 3.12-nogf.20230503.async.2.json | Change       | Significance           |
+===============================+=================================+=================================+==============+========================+
| async_tree_cpu_io_mixed       | 868 ms                          | 859 ms                          | 1.01x faster | Not significant        |
+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+
| async_tree_eager              | 391 ms                          | 129 ms                          | 3.03x faster | Significant (t=209.74) |
+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+
| async_tree_eager_cpu_io_mixed | 756 ms                          | 490 ms                          | 1.54x faster | Significant (t=167.41) |
+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+
| async_tree_eager_io           | 1.51 sec                        | 1.50 sec                        | 1.00x faster | Not significant        |
+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+
| async_tree_eager_memoization  | 595 ms                          | 314 ms                          | 1.89x faster | Significant (t=70.25)  |
+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+
| async_tree_io                 | 1.39 sec                        | 1.40 sec                        | 1.00x slower | Not significant        |
+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+
| async_tree_memoization        | 677 ms                          | 683 ms                          | 1.01x slower | Not significant        |
+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+
| async_tree_none               | 575 ms                          | 574 ms                          | 1.00x faster | Not significant        |
+-------------------------------+---------------------------------+---------------------------------+--------------+------------------------+