Less blocks in groupby by WillAyd · Pull Request #29753 · pandas-dev/pandas (original) (raw)

This isn't complete (2 tests fail) but this is arguably controversial, so putting up for review.

With #29124 we could get rid of the DataFrame aggregation methods that go by blocks. This would hopefully make our code easier to grok and move towards only "one way of doing things" in groupby.

This causes a definite regression for very wide DataFrames with a small number of rows. If you look at the new benchmark created, when testing Frames of N rows x 10_000 columns where N is one of 100, 1000 or 10_000 here are the results:

   before           after         ratio
 [6b3ba986]       [124a135a]
 <categorical-regression>       <easier-groupby>

 3.94±0.2ms        2.55±0.1s   646.23  groupby.GroupByWide.time_wide(100)

   38.3±6ms       2.49±0.02s    64.91  groupby.GroupByWide.time_wide(1000)

So drastically slower at 100 rows, but we've already reached a break-even point by the time we get to 10,000 rows.

The rest of the benchmarks don't cover wide data frames that well but here are the results for posterity (I think these are noise)

   before           after         ratio
 [6b3ba986]       [124a135a]
 <categorical-regression>       <easier-groupby>

   862±30μs       1.15±0.1ms     1.34  groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'direct')

    412±7μs         533±50μs     1.29  groupby.GroupByMethods.time_dtype_as_group('float', 'tail', 'direct')

   300±10μs         366±10μs     1.22  groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'direct')

   375±10ms          455±6ms     1.21  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'transformation')

    287±6μs         335±20μs     1.17  groupby.GroupByMethods.time_dtype_as_group('object', 'shift', 'direct')

    252±2ms         290±10ms     1.15  groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'direct')

    414±3μs         469±10μs     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'tail', 'transformation')

    398±9ms          450±8ms     1.13  groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct')

@jbrockmendel