PERF: Improved performance for .str.encode/decode by Winand · Pull Request #13008 · pandas-dev/pandas (original) (raw)
I need such a patch to read huge sas
tables encoded in cp1251
. I'm not experienced enough to determine if such a patch is really needed here, but well.. it gives some speed in certain situations.
Optimize string encoding-decoding, leave default implementation for CPython optimized encodings,
(see https://docs.python.org/3.4/library/codecs.html#standard-encodings)
string
import pandas as pd
s1 = pd.Series(pd.util.testing.makeStringIndex(k=100000)).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32', 'latin1', 'iso-8859-1', 'mbcs', 'ascii', 'cp1251', 'cp1252'
for enc in encs:
s2 = s1.str.encode(enc).astype('category')
print(enc)
%timeit s1.str.encode(enc)
%timeit s2.str.decode(enc)
unicode
import pandas as pd
s1 = pd.Series(pd.util.testing.makeUnicodeIndex(k=100000)).astype('category')
encs = 'utf-8', 'utf-16', 'utf-32'
for enc in encs:
s2 = s1.str.encode(enc).astype('category')
print(enc)
%timeit s1.str.encode(enc)
%timeit s2.str.decode(enc)