Optimize percent-encoded UTF8 processing in Uri by MihaZupan · Pull Request #32552 · dotnet/runtime (original) (raw)

When unescaping percent-encoded non-ascii we currently:

  1. allocate a byte[] buffer
  2. decode the entire hex encoded uri into bytes
  3. allocate a char[] buffer
  4. convert the bytes into chars via Utf8Encoding
  5. analyze both buffers to see if any characters/bytes were skipped by converting chars to Runes to Utf8 bytes and comparing

This PR changes it into performing a single pass, writing to the ValueStringBuilder without temporary buffers.

Currently there is a behavioral change where before all hex characters would be upper-cased, now their input-casing is preserved. Keeping the old behavior is a trivial change with a bit of a perf penalty.

I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.

Perf goes up significantly whenever this unescaping path is hit
(The allocation win is hit whenever there is a single non-ascii char in the input)

Method Toolchain Mean Ratio Gen 0 Allocated
NewUri_Chinese \clean\CoreRun.exe 11,644.9 ns 1.57 1.2817 5384 B
NewUri_Chinese \new\CoreRun.exe 7,422.8 ns 1.00 0.2136 920 B
UnescapeDataString_Chinese \clean\CoreRun.exe 9,514.7 ns 2.24 1.0986 4664 B
UnescapeDataString_Chinese \new\CoreRun.exe 4,245.9 ns 1.00 0.0763 344 B
UnescapeDataString_Chinese_Short \clean\CoreRun.exe 1,402.5 ns 3.03 0.1545 656 B
UnescapeDataString_Chinese_Short \new\CoreRun.exe 462.7 ns 1.00 0.0148 64 B
UnescapeDataString_Emoji \clean\CoreRun.exe 53,014.0 ns 2.62 9.5215 40072 B
UnescapeDataString_Emoji \new\CoreRun.exe 20,259.3 ns 1.00 0.9460 4024 B

Updated benchmarks: #32552 (comment)