Improve HTTP/2 scaling performance by stephentoub · Pull Request #35694 · dotnet/runtime (original) (raw)
Contributes to #35184
cc: @scalablecory, @wfurt, @JamesNK
I used James' repro from the linked issue, but focused on the "r" mode, which is the one that uses HttpClient directly rather than going through higher layers. I also made some tweaks to the code, in JamesNK/Http2Perf#1, to remove some of the overhead from that layer.
Net net, at 100 concurrent requests in this repro this PR improves RPS by ~2x.
These numbers were gathered on top of #35575.
Before, "r 1 false", Kestrel, client GC
Successfully processed 13424; RPS 7723; Errors 0; Last elapsed 0.1111ms
Successfully processed 21233; RPS 7807; Errors 0; Last elapsed 0.1194ms
Successfully processed 29050; RPS 7815; Errors 0; Last elapsed 0.1277ms
Successfully processed 36905; RPS 7853; Errors 0; Last elapsed 0.1207ms
Successfully processed 44841; RPS 7935; Errors 0; Last elapsed 0.1261ms
After, "r 1 false", Kestrel, client GC
Successfully processed 13891; RPS 8071; Errors 0; Last elapsed 0.1182ms
Successfully processed 21925; RPS 8032; Errors 0; Last elapsed 0.122ms
Successfully processed 29912; RPS 7985; Errors 0; Last elapsed 0.1411ms
Successfully processed 37975; RPS 8062; Errors 0; Last elapsed 0.1402ms
Successfully processed 46108; RPS 8131; Errors 0; Last elapsed 0.1257ms
Before, "r 100 false", Kestrel, client GC
Successfully processed 79559; RPS 47748; Errors 0; Last elapsed 2.2332ms
Successfully processed 127500; RPS 47927; Errors 0; Last elapsed 2.0807ms
Successfully processed 175485; RPS 47975; Errors 0; Last elapsed 1.9676ms
Successfully processed 223111; RPS 47607; Errors 0; Last elapsed 2.0324ms
Successfully processed 271552; RPS 48428; Errors 0; Last elapsed 1.7378ms
After, "r 100 false", Kestrel, client GC
Successfully processed 157536; RPS 93753; Errors 0; Last elapsed 1.0717ms
Successfully processed 249660; RPS 92102; Errors 0; Last elapsed 0.8804ms
Successfully processed 339840; RPS 90051; Errors 0; Last elapsed 0.5867ms
Successfully processed 432333; RPS 92486; Errors 0; Last elapsed 0.3346ms
Successfully processed 525563; RPS 93227; Errors 0; Last elapsed 1.3646ms
Before, "r 100 false", Kestrel, server GC
Successfully processed 110621; RPS 42381; Errors 0; Last elapsed 2.6166ms
Successfully processed 152748; RPS 42110; Errors 0; Last elapsed 2.2373ms
Successfully processed 195372; RPS 42606; Errors 0; Last elapsed 2.2102ms
Successfully processed 237832; RPS 42451; Errors 0; Last elapsed 2.1775ms
Successfully processed 280273; RPS 42424; Errors 0; Last elapsed 2.2575ms
After, "r 100 false", Kestrel, server GC
Successfully processed 232393; RPS 101467; Errors 0; Last elapsed 1.4772ms
Successfully processed 333288; RPS 100866; Errors 0; Last elapsed 0.4939ms
Successfully processed 434999; RPS 101684; Errors 0; Last elapsed 0.325ms
Successfully processed 537804; RPS 102797; Errors 0; Last elapsed 0.4198ms
Successfully processed 638216; RPS 100403; Errors 0; Last elapsed 1.3467ms
There's still a gap with the C gRPC client, but it's much smaller now. On this benchmark HttpClient is ~18% faster with one concurrent request:
After "r 1 false", Kestrel, server GC
Successfully processed 54093; RPS 8102; Errors 0; Last elapsed 0.116ms
Successfully processed 62136; RPS 8041; Errors 0; Last elapsed 0.1253ms
Successfully processed 70204; RPS 8066; Errors 0; Last elapsed 0.124ms
Successfully processed 78212; RPS 8007; Errors 0; Last elapsed 0.1398ms
Successfully processed 86269; RPS 8055; Errors 0; Last elapsed 0.1541ms
"c 1 false", Kestrel
Successfully processed 12240; RPS 6480; Errors 0; Last elapsed 0.1659ms
Successfully processed 18691; RPS 6449; Errors 0; Last elapsed 0.149ms
Successfully processed 25202; RPS 6510; Errors 0; Last elapsed 0.1602ms
Successfully processed 31615; RPS 6411; Errors 0; Last elapsed 0.148ms
Successfully processed 38051; RPS 6435; Errors 0; Last elapsed 0.1745ms
whereas it's ~18% slower with 100 concurrent requests:
After "r 100 false", Kestrel, server GC
Successfully processed 333318; RPS 103495; Errors 0; Last elapsed 0.4614ms
Successfully processed 438239; RPS 104921; Errors 0; Last elapsed 1.3381ms
Successfully processed 541034; RPS 102781; Errors 0; Last elapsed 0.7129ms
Successfully processed 645850; RPS 104676; Errors 0; Last elapsed 0.5231ms
Successfully processed 749534; RPS 103589; Errors 0; Last elapsed 0.3373ms
"c 100 false", Kestrel
Successfully processed 234346; RPS 126174; Errors 0; Last elapsed 0.8313ms
Successfully processed 361244; RPS 126825; Errors 0; Last elapsed 0.6312ms
Successfully processed 487486; RPS 126207; Errors 0; Last elapsed 0.993ms
Successfully processed 612871; RPS 125318; Errors 0; Last elapsed 0.6118ms
Successfully processed 740618; RPS 127681; Errors 0; Last elapsed 0.5951ms
I expect there's still room for improvements; I only fixed the most egregious issues showing up in traces, and a few smaller things I noticed along the way.
The changes are broken out in individual commits. Quick summaries:
- A "header serialization lock" was being employed and was a significant source of contention. This lock protected a shared header buffer. Rather than maintaining such a buffer on the Http2Connection, we instead just rent a pool buffer to write the headers into, which means we no longer need to serialize access to the shared buffer. The lock has been deleted.
- Use a custom async mutex that significantly reduces contention based on the access patterns employed in Http2Connection, while also enabling us to avoid allocation per operation.
- Changes how we handle fire-and-forget operations to employ a cheaper way of longing failures if they occur.
- Removed several unnecessary async methods, e.g. removed a CopyToAsync wrapping task in SendRequestBodyAsync, combined AcquireWriteLockAsync into StartWriteAsync, removed the GetCancelableWaiterTask method
- Stopped clearing an ArrayPool buffer when it was being returned; the zero'ing was showing up in traces, and everywhere else we were returning we weren't clearing.
- Added commonly employed gRPC response headers to known headers, and added a static lookup for interned content-type strings
- Avoided an unnecessary dictionary lookup