[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Tue Aug 23 12:20:28 CEST 2011


Am 23.08.2011 11:46, schrieb Xavier Morel:

On 2011-08-23, at 10:55 , Martin v. Löwis wrote:

- “The UTF-8 decoding fast path for ASCII only characters was removed and replaced with a memcpy if the entire string is ASCII.” The fast path would still be useful for mostly-ASCII strings, which are extremely common (unless UTF-8 has become a no-op?).

Is it really extremely common to have strings that are mostly-ASCII but not completely ASCII? I would agree that pure ASCII strings are extremely common. Mostly ascii is pretty common for western-european languages (French, for instance, is probably 90 to 95% ascii). It's also a risk in english, when the writer "correctly" spells foreign words (résumé and the like).

I know - I still question whether it is "extremely common" (so much as to justify a special case). I.e. on what application with what dataset would you gain what speedup, at the expense of what amount of extra lines, and potential slow-down for other datasets?

For the record, the optimization in question is the one where it masks a long word with 0x80808080L, to see whether it is completely ASCII, and then copies four characters in an unrolled fashion. It stops doing so when it sees a non-ASCII character, and returns to that mode when it gets to the next aligned memory address that stores only ASCII characters.

In the PEP 393 approach, if the string has a two-byte representation, each character needs to widened to two bytes, and likewise for four bytes. So three separate copies of the unrolled loop would be needed, one for each target size.

Regards, Martin



More information about the Python-Dev mailing list