PERF: use StringHasTable for strings by jreback · Pull Request #14859 · pandas-dev/pandas (original) (raw)
xref #13745
provides a modest speedup for all string hashing. The key thing is, it will release the GIL on more operations where this is possible (mainly factorize).
can be easily extended to .value_counts() and .duplicated() (for strings), new issue for that one
In [9]: np.random.seed(1234)
In [10]: strings = tm.makeStringIndex(1000000)
In [11]: def f():
...: for i in range(2):
...: pd.factorize(strings)
...:
In [12]: @tm.test_parallel(num_threads=2)
...: def g():
...: pd.factorize(strings)
...:
In [13]: %timeit f()
1 loop, best of 3: 685 ms per loop
In [14]: %timeit g()
1 loop, best of 3: 446 ms per loop
In [15]: strings = strings.take(np.random.randint(0,1000,size=len(strings)))
In [16]: strings.nunique()
Out[16]: 1000
In [17]: %timeit f()
1 loop, best of 3: 222 ms per loop
In [18]: %timeit g()
10 loops, best of 3: 190 ms per loop