ENH/PERF: use mask in factorize for nullable dtypes by jorisvandenbossche · Pull Request #33064 · pandas-dev/pandas (original) (raw)
This adds the option to use a mask in the HashTable.factorize
(implementation itself is in HashTable._unique
).
That allows eg IntegerArray to use its mask and avoid the need to convert to object dtype or float dtype (which also avoids a copy of the data), and gives a nice speed-up (so checking for na_value or nan, only checking the mask):
Small example (for easier testing purposes I just added factorize2
that uses the mask, vs existing factorize
, but that is not meant to stay of course):
In [1]: a = np.random.randint(0, 10000, 100000)
...: mask = np.zeros(len(a), dtype=bool)
...: mask[np.random.randint(0, len(a), 1000)] = True
...: arr = pd.arrays.IntegerArray(a, mask)
In [2]: arr.factorize() Out[2]: (array([ 0, 1, 2, ..., 3726, 2673, 5358]), [2532, 8355, 1885, 2253, 8517, 5615, 3146, 386, 9183, 6497, ... 794, 8600, 823, 1541, 4373, 1205, 9605, 4576, 443, 2070] Length: 10000, dtype: Int64)
using mask gives the same result
In [3]: arr.factorize2()
Out[3]:
(array([ 0, 1, 2, ..., 3726, 2673, 5358]),
[2532, 8355, 1885, 2253, 8517, 5615, 3146, 386, 9183, 6497,
...
794, 8600, 823, 1541, 4373, 1205, 9605, 4576, 443, 2070]
Length: 10000, dtype: Int64)
on master we actually convert to object dtype with NaNs, which is very slow
In [4]: %timeit arr.factorize() 12.3 ms ± 849 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
on this PR, for comparision, I changed this to convert to float dtype with NaNs
(however this is not robust for large ints)
In [4]: %timeit arr.factorize() 2.83 ms ± 51.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
on this PR using the mask is faster than both approaches above
In [5]: %timeit arr.factorize2() 771 µs ± 13.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since this is adding an extra if branch to factorize
, this can impact existing cases that don't use the mask, but I didn't see any noticeable effect (with few times switching between branches and rebuilding):
In [1]: a = np.random.randint(0, 10000, 100000)
...: mask = np.zeros(len(a), dtype=bool)
...: mask[np.random.randint(0, len(a), 1000)] = True
...:
...: a2 = a.copy().astype(float)
...: a2[mask] = np.nan
factorize on an integer array
In [4]: %timeit pd.factorize(a) 769 µs ± 31.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <- PR 745 µs ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <- master 779 µs ± 60.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <- PR 759 µs ± 50 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) <- master
factorize on a float array with nans
In [5]: %timeit pd.factorize(a2) 2.12 ms ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR 2.2 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- master 2.2 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- PR 2.13 ms ± 47.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) <- master
So for the int array there seems to be a small slowdown (as expected), although it is still within the +/- bounds. And I would say it is also small enough (around 2-3%) to be OK with this slowdown given the benefits of being able to use the mask for other cases.