pandas (original) (raw)

Currently, for reductions on a DataFrame, we convert the full DataFrame to a single "interleaved" array and then perform the operation. That's the default, but when numeric_only=True is specified, it is done block-wise.

Enabling column-wise reductions (or block-wise for EAs):

Gives better performance in common cases / no need to create a new 2D array (which goes through object dtype for nullable ints)
Ensures to use the reduction implementation of the EA itself, which can be more correct / more efficient than converting to an ndarray and using our generic nanops.

For illustration purposes, I added a column_wise keyword in this PR (not meant to keep this, just for testing), so we can compare a few cases:

In [9]: df_wide = pd.DataFrame(np.random.randint(1000, size=(1000,100))).astype("Int64").copy() 

In [15]: %timeit df_wide.mean()   
9.68 ms ± 100 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [16]: %timeit df_wide.mean(numeric_only=True)
10.1 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [17]: %timeit df_wide.mean(column_wise=True)  
5.22 ms ± 29.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [18]: df_long = pd.DataFrame(np.random.randint(1000, size=(10000,10))).astype("Int64").copy()  

In [19]: %timeit df_long.mean()
7.77 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [20]: %timeit df_long.mean(numeric_only=True)         
2.07 ms ± 31.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: %timeit df_long.mean(column_wise=True) 
1.04 ms ± 4.54 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So I experimented with two approaches:

First by iterating through the columns and calling the _reduce of the underlying EA (this path gets taken by using the temporary keyword column_wise=True)
Fixing the block-wise case for extension blocks (triggered by numeric_only=True) by changing that to also use _reduce of the EA (currently this was failing by calling nanops functions on the EA)

The first gives better performance (it is simpler in implementation by not involding the blocks), but requires some more new code (it uses less the existing machinery).

Ideally, for EA columns, we should always use their own reduction implementation (thus call EA._reduce), I think. So for both approaches, the question will be how to trigger this behaviour.

Closes #32651, closes #34520