PERF: use StringHasTable for strings by jreback · Pull Request #14859 · pandas-dev/pandas (original) (raw)

xref #13745

provides a modest speedup for all string hashing. The key thing is, it will release the GIL on more operations where this is possible (mainly factorize).

can be easily extended to .value_counts() and .duplicated() (for strings), new issue for that one

In [9]: np.random.seed(1234)

In [10]: strings = tm.makeStringIndex(1000000)

In [11]: def f():
    ...:     for i in range(2):
    ...:         pd.factorize(strings)
    ...:         

In [12]: @tm.test_parallel(num_threads=2)
    ...: def g():
    ...:     pd.factorize(strings)
    ...:     

In [13]: %timeit f()
1 loop, best of 3: 685 ms per loop

In [14]: %timeit g()
1 loop, best of 3: 446 ms per loop

In [15]: strings = strings.take(np.random.randint(0,1000,size=len(strings)))

In [16]: strings.nunique()
Out[16]: 1000

In [17]: %timeit f()
1 loop, best of 3: 222 ms per loop

In [18]: %timeit g()
10 loops, best of 3: 190 ms per loop