[Python-Dev] PEP 393 Summer of Code Project (original) (raw)
Antoine Pitrou solipsis at pitrou.net
Tue Aug 23 13:39:02 CEST 2011
- Previous message: [Python-Dev] PEP 393 Summer of Code Project
- Next message: [Python-Dev] PEP 393 Summer of Code Project
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> Is it really extremely common to have strings that are mostly-ASCII but >> not completely ASCII? I would agree that pure ASCII strings are >> extremely common. > Mostly ascii is pretty common for western-european languages (French, for > instance, is probably 90 to 95% ascii). It's also a risk in english, when > the writer "correctly" spells foreign words (résumé and the like).
I know - I still question whether it is "extremely common" (so much as to justify a special case).
Well, it's:
- all natural languages based on a variant of the latin alphabet
- but also, XML, JSON, HTML documents...
- and log files...
- in short, any kind of parsable format which is structurally ASCII but and can contain arbitrary unicode
So I would say most unicode data out there is mostly-ASCII, even when it has Japanese characters in it. The rationale is that most unicode data processed by computers is structured.
This optimization was done when trying to improve the speed of text I/O.
In the PEP 393 approach, if the string has a two-byte representation, each character needs to widened to two bytes, and likewise for four bytes. So three separate copies of the unrolled loop would be needed, one for each target size.
Do you have three copies of the UTF-8 decoder already, or do you a use a stringlib-like approach?
Regards
Antoine.
- Previous message: [Python-Dev] PEP 393 Summer of Code Project
- Next message: [Python-Dev] PEP 393 Summer of Code Project
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]