[Python-Dev] PEP 393 decode() oddity (original) (raw)
martin at v.loewis.de martin at v.loewis.de
Sun Mar 25 22:55:13 CEST 2012
- Previous message: [Python-Dev] PEP 393 decode() oddity
- Next message: [Python-Dev] PEP 393 decode() oddity
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Anyone can test.
$ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0020" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 59.4 usec per loop $ ./python -m timeit -s 'enc = "latin1"; import codecs; d = codecs.getdecoder(enc); x = ("\u0080" * 100000).encode(enc)' 'd(x)' 10000 loops, best of 3: 28.4 usec per loop The results are fairly stable (±0.1 µsec) from run to run. It looks funny thing.
This is not surprising. When decoding Latin-1, it needs to determine whether the string is pure ASCII or not. If it is not, it must be all Latin-1 (it can't be non-Latin-1).
For a pure ASCII string, it needs to scan over the entire string, trying to find a non-ASCII character. If there is none, it has to inspect the entire string.
In your example, as the first character is is already above 127, search for the maximum character can stop, so it needs to scan the string only once.
Try '\u0020' * 999999+'\u0080', which is a non-ASCII string but still takes the same time as the pure ASCII string.
Regards, Martin
- Previous message: [Python-Dev] PEP 393 decode() oddity
- Next message: [Python-Dev] PEP 393 decode() oddity
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]