Optimize percent-encoded UTF8 processing in Uri by MihaZupan · Pull Request #32552 · dotnet/runtime (original) (raw)

When unescaping percent-encoded non-ascii we currently:

allocate a byte[] buffer
decode the entire hex encoded uri into bytes
allocate a char[] buffer
convert the bytes into chars via Utf8Encoding
analyze both buffers to see if any characters/bytes were skipped by converting chars to Runes to Utf8 bytes and comparing

This PR changes it into performing a single pass, writing to the ValueStringBuilder without temporary buffers.

Currently there is a behavioral change where before all hex characters would be upper-cased, now their input-casing is preserved. Keeping the old behavior is a trivial change with a bit of a perf penalty.

I should note that the current behavior of upper-casing hex is only done for non-ascii characters. If we only have Ascii, the input-casing is preserved, so the behavior is the same for Ascii and non-ascii after this change.

Perf goes up significantly whenever this unescaping path is hit
(The allocation win is hit whenever there is a single non-ascii char in the input)

Method	Toolchain	Mean	Ratio	Gen 0	Allocated
NewUri_Chinese	\clean\CoreRun.exe	11,644.9 ns	1.57	1.2817	5384 B
NewUri_Chinese	\new\CoreRun.exe	7,422.8 ns	1.00	0.2136	920 B
UnescapeDataString_Chinese	\clean\CoreRun.exe	9,514.7 ns	2.24	1.0986	4664 B
UnescapeDataString_Chinese	\new\CoreRun.exe	4,245.9 ns	1.00	0.0763	344 B
UnescapeDataString_Chinese_Short	\clean\CoreRun.exe	1,402.5 ns	3.03	0.1545	656 B
UnescapeDataString_Chinese_Short	\new\CoreRun.exe	462.7 ns	1.00	0.0148	64 B
UnescapeDataString_Emoji	\clean\CoreRun.exe	53,014.0 ns	2.62	9.5215	40072 B
UnescapeDataString_Emoji	\new\CoreRun.exe	20,259.3 ns	1.00	0.9460	4024 B

Updated benchmarks: #32552 (comment)