Less blocks in groupby by WillAyd · Pull Request #29753 · pandas-dev/pandas (original) (raw)
This isn't complete (2 tests fail) but this is arguably controversial, so putting up for review.
With #29124 we could get rid of the DataFrame aggregation methods that go by blocks. This would hopefully make our code easier to grok and move towards only "one way of doing things" in groupby.
This causes a definite regression for very wide DataFrames with a small number of rows. If you look at the new benchmark created, when testing Frames of N rows x 10_000 columns where N is one of 100, 1000 or 10_000 here are the results:
before after ratio
[6b3ba986] [124a135a]
<categorical-regression> <easier-groupby>3.94±0.2ms 2.55±0.1s 646.23 groupby.GroupByWide.time_wide(100)38.3±6ms 2.49±0.02s 64.91 groupby.GroupByWide.time_wide(1000)
So drastically slower at 100 rows, but we've already reached a break-even point by the time we get to 10,000 rows.
The rest of the benchmarks don't cover wide data frames that well but here are the results for posterity (I think these are noise)
before after ratio
[6b3ba986] [124a135a]
<categorical-regression> <easier-groupby>862±30μs 1.15±0.1ms 1.34 groupby.GroupByMethods.time_dtype_as_field('int', 'cummin', 'direct')412±7μs 533±50μs 1.29 groupby.GroupByMethods.time_dtype_as_group('float', 'tail', 'direct')300±10μs 366±10μs 1.22 groupby.GroupByMethods.time_dtype_as_group('object', 'all', 'direct')375±10ms 455±6ms 1.21 groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'transformation')287±6μs 335±20μs 1.17 groupby.GroupByMethods.time_dtype_as_group('object', 'shift', 'direct')252±2ms 290±10ms 1.15 groupby.GroupByMethods.time_dtype_as_group('float', 'unique', 'direct')414±3μs 469±10μs 1.13 groupby.GroupByMethods.time_dtype_as_group('float', 'tail', 'transformation')398±9ms 450±8ms 1.13 groupby.GroupByMethods.time_dtype_as_group('float', 'skew', 'direct')