PERF: Improved performance for .str.encode/decode by Winand · Pull Request #13008 · pandas-dev/pandas (original) (raw)

I need such a patch to read huge sas tables encoded in cp1251. I'm not experienced enough to determine if such a patch is really needed here, but well.. it gives some speed in certain situations.

Optimize string encoding-decoding, leave default implementation for CPython optimized encodings,
(see https://docs.python.org/3.4/library/codecs.html#standard-encodings)

string

import pandas as pd
s1 = pd.Series(pd.util.testing.makeStringIndex(k=100000)).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32', 'latin1', 'iso-8859-1', 'mbcs', 'ascii', 'cp1251', 'cp1252'
for enc in encs:
    s2 = s1.str.encode(enc).astype('category')
    print(enc)
    %timeit s1.str.encode(enc)
    %timeit s2.str.decode(enc)

unicode

import pandas as pd
s1 = pd.Series(pd.util.testing.makeUnicodeIndex(k=100000)).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32'
for enc in encs:
    s2 = s1.str.encode(enc).astype('category')
    print(enc)
    %timeit s1.str.encode(enc)
    %timeit s2.str.decode(enc)

image
("10 loops, best of 3: xxx ms per loop")