Update Unicode data and optimize CharUnicodeInfo indexes by pentp · Pull Request #20983 · dotnet/coreclr (original) (raw)
Category data is updated to the latest Unicode 11 data (#20589 somehow missed some changes).
I converted the Perl script from #20864 (comment) to C# and included it in the Tools folder.
Changed the index structure and encoding, reducing index size from 37KB to 23KB.
I verified that all data returned for every Unicode character is identical to the previous implementation using the latest Unicode data.
Category data lookup times improved from 3.4ns to 2.4ns (1.4x faster) for char based lookups (from 5.7ns to 4.4ns for string based lookups). Tried with different kinds of text.
For comparison, char.GetUnicodeCategory
for Latin1 data takes 1.0ns.