[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

Antoine Pitrou solipsis at pitrou.net
Tue Aug 23 13:39:02 CEST 2011


>> Is it really extremely common to have strings that are mostly-ASCII but >> not completely ASCII? I would agree that pure ASCII strings are >> extremely common. > Mostly ascii is pretty common for western-european languages (French, for > instance, is probably 90 to 95% ascii). It's also a risk in english, when > the writer "correctly" spells foreign words (résumé and the like).

I know - I still question whether it is "extremely common" (so much as to justify a special case).

Well, it's:

So I would say most unicode data out there is mostly-ASCII, even when it has Japanese characters in it. The rationale is that most unicode data processed by computers is structured.

This optimization was done when trying to improve the speed of text I/O.

In the PEP 393 approach, if the string has a two-byte representation, each character needs to widened to two bytes, and likewise for four bytes. So three separate copies of the unrolled loop would be needed, one for each target size.

Do you have three copies of the UTF-8 decoder already, or do you a use a stringlib-like approach?

Regards

Antoine.



More information about the Python-Dev mailing list