ENH/PERF: use mask in factorize for nullable dtypes by jorisvandenbossche · Pull Request #33064 · pandas-dev/pandas (original) (raw)

This adds the option to use a mask in the HashTable.factorize (implementation itself is in HashTable._unique).

That allows eg IntegerArray to use its mask and avoid the need to convert to object dtype or float dtype (which also avoids a copy of the data), and gives a nice speed-up (so checking for na_value or nan, only checking the mask):

Small example (for easier testing purposes I just added factorize2 that uses the mask, vs existing factorize, but that is not meant to stay of course):

In [1]: a = np.random.randint(0, 10000, 100000) ...: mask = np.zeros(len(a), dtype=bool)
...: mask[np.random.randint(0, len(a), 1000)] = True ...: arr = pd.arrays.IntegerArray(a, mask)

In [2]: arr.factorize() Out[2]: (array([ 0, 1, 2, ..., 3726, 2673, 5358]), [2532, 8355, 1885, 2253, 8517, 5615, 3146, 386, 9183, 6497, ... 794, 8600, 823, 1541, 4373, 1205, 9605, 4576, 443, 2070] Length: 10000, dtype: Int64)

using mask gives the same result

In [3]: arr.factorize2()
Out[3]: (array([ 0, 1, 2, ..., 3726, 2673, 5358]), [2532, 8355, 1885, 2253, 8517, 5615, 3146, 386, 9183, 6497, ... 794, 8600, 823, 1541, 4373, 1205, 9605, 4576, 443, 2070] Length: 10000, dtype: Int64)

on master we actually convert to object dtype with NaNs, which is very slow

In [4]: %timeit arr.factorize() 12.3 ms ± 849 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

on this PR, for comparision, I changed this to convert to float dtype with NaNs

(however this is not robust for large ints)

In [4]: %timeit arr.factorize() 2.83 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

on this PR using the mask is faster than both approaches above

In [5]: %timeit arr.factorize2() 771 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Since this is adding an extra if branch to factorize, this can impact existing cases that don't use the mask, but I didn't see any noticeable effect (with few times switching between branches and rebuilding):

In [1]: a = np.random.randint(0, 10000, 100000) ...: mask = np.zeros(len(a), dtype=bool)
...: mask[np.random.randint(0, len(a), 1000)] = True ...:
...: a2 = a.copy().astype(float) ...: a2[mask] = np.nan

factorize on an integer array

In [4]: %timeit pd.factorize(a) 769 µs ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <- PR 745 µs ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <- master 779 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <- PR 759 µs ± 50 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <- master

factorize on a float array with nans

In [5]: %timeit pd.factorize(a2) 2.12 ms ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR 2.2 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- master 2.2 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR 2.13 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- master

So for the int array there seems to be a small slowdown (as expected), although it is still within the +/- bounds. And I would say it is also small enough (around 2-3%) to be OK with this slowdown given the benefits of being able to use the mask for other cases.