[Python-Dev] PEP 393 decode() oddity (original) (raw)

martin at v.loewis.de martin at v.loewis.de
Sun Mar 25 22:55:13 CEST 2012


Anyone can test.

$ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 59.4 usec per loop $ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0080" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 28.4 usec per loop The results are fairly stable (±0.1 µsec) from run to run. It looks funny thing.

This is not surprising. When decoding Latin-1, it needs to determine whether the string is pure ASCII or not. If it is not, it must be all Latin-1 (it can't be non-Latin-1).

For a pure ASCII string, it needs to scan over the entire string, trying to find a non-ASCII character. If there is none, it has to inspect the entire string.

In your example, as the first character is is already above 127, search for the maximum character can stop, so it needs to scan the string only once.

Try '\u0020' * 999999+'\u0080', which is a non-ASCII string but still takes the same time as the pure ASCII string.

Regards, Martin



More information about the Python-Dev mailing list