PERF: use StringHasTable for strings by jreback · Pull Request #14859 · pandas-dev/pandas (original) (raw)
xref #13745
provides a modest speedup for all string hashing. The key thing is, it will release the GIL on more operations where this is possible (mainly factorize
).
can be easily extended to .value_counts()
and .duplicated()
(for strings), new issue for that one
In [9]: np.random.seed(1234)
In [10]: strings = tm.makeStringIndex(1000000)
In [11]: def f():
...: for i in range(2):
...: pd.factorize(strings)
...:
In [12]: @tm.test_parallel(num_threads=2)
...: def g():
...: pd.factorize(strings)
...:
In [13]: %timeit f()
1 loop, best of 3: 685 ms per loop
In [14]: %timeit g()
1 loop, best of 3: 446 ms per loop
In [15]: strings = strings.take(np.random.randint(0,1000,size=len(strings)))
In [16]: strings.nunique()
Out[16]: 1000
In [17]: %timeit f()
1 loop, best of 3: 222 ms per loop
In [18]: %timeit g()
10 loops, best of 3: 190 ms per loop