Avoid substring allocations in WebUtility.HtmlDecode by juliushardt · Pull Request #29402 · dotnet/corefx (original) (raw)

Fixes #13960. In contrast to #27250, this PR focuses exclusively on the substring allocations and not on the StringBuilder/TextWriter part.

There are three substring allocations in WebUtility: L199, L203 and L234.

While the first two ones can be avoided by simply changing a few API calls since uint.TryParse now accepts Span<char>, the third one involves a bit more work as it requires the html entity lookup table to be modified. @jamesqo described a possible solution in #13960, however, as he pointed out, it might have an initialization impact. The one implemented here is less flexible, but comes without additional initialization costs. The idea is that because all supported html entity strings are 8 characters or less and are ASCII-only (and hence each character of the entity string can be represented by a single byte), they can be squeezed into an UInt64, which serves as the key in the lookup table. Instructions to compute the key for possible future entries are included in the code comments. A disadvantage of this approach is that potential future HTML entity strings of 9 characters or more would require an additional code path. That being said, HtmlDecode now runs up to 35% faster:

Method	Mean	Error	StdDev	Gen 0	Allocated
TestOldImplementation	2.804 us	0.0558 us	0.0933 us	0.4425	1864 B
TestNewImplementation	1.798 us	0.0354 us	0.0363 us	0.2365	1000 B

I think that the performance improvements justify the added complexity and decreased readability.