REF: Compute complete result_index upfront in groupby by rhshadrach · Pull Request #55738 · pandas-dev/pandas (original) (raw)

Just a PoC at this point, need to try to move behavior changes out of here as much as possible.

The main change this makes is it moves the computation of unobserved groups upfront. Currently, we only include unobserved groups upfront if there is a single grouping (e.g. df.groupby('a')) but not two or more (e.g. df.groupby(['a', 'b'])). When there are two or more, we go through the groupby computations with only observed groups, and then tack on the unobserved groups at the end. By always including unobserved groups upfront, we can simplify the logic in the groupby code. Having unobserved groups sometimes included and sometimes not included upfront is also a footgun.

In order to make this change, I found I needed to rework a bit of how NA values are handled in Grouping._codes_and_uniques. This in turn fixed some NA bugs.

But it does add BaseGrouper.result_index_and_codes. This computes the (aggregated) result index that takes into account dropna and observed, along with the codes for the groups themselves.

ASVs; no performance regressions (with the standard 10% cutoff) other than groupby transform operations with multiple categorical groupings.

| Change   | Before [2b67593b] <test>   | After [e285742a] <gb_observed_pre>   |   Ratio | Benchmark (Parameter)                                                                                              |
|----------|----------------------------|--------------------------------------|---------|--------------------------------------------------------------------------------------------------------------------|
| +        | 308±4μs                    | 1.25±0ms                             |    4.08 | groupby.MultipleCategories.time_groupby_transform                                                                  |
| -        | 5.10±0.04ms                | 4.61±0.06ms                          |    0.91 | groupby.Apply.time_scalar_function_multi_col(5)                                                                    |
| -        | 239±4μs                    | 216±0.2μs                            |    0.9  | groupby.Categories.time_groupby_sort(False)                                                                        |
| -        | 60.1±2μs                   | 53.8±0.4μs                           |    0.9  | groupby.GroupByMethods.time_dtype_as_field('float', 'size', 'direct', 1, 'cython')                                 |
| -        | 241±5μs                    | 215±1μs                              |    0.89 | groupby.Categories.time_groupby_ordered_sort(False)                                                                |
| -        | 556±7μs                    | 488±60μs                             |    0.88 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function mul>)       |
| -        | 235±5μs                    | 205±2μs                              |    0.87 | groupby.Categories.time_groupby_extra_cat_sort(False)                                                              |
| -        | 419±8μs                    | 356±10μs                             |    0.85 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function le>)        |
| -        | 549±10μs                   | 461±50μs                             |    0.84 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function truediv>) |
| -        | 21.7±0.7μs                 | 18.1±0.06μs                          |    0.83 | groupby.GroupByMethods.time_dtype_as_field('uint', 'count', 'direct', 1, 'cython')                                 |
| -        | 22.5±0.3μs                 | 18.6±0.2μs                           |    0.83 | groupby.GroupByMethods.time_dtype_as_group('float', 'count', 'direct', 1, 'cython')                                |
| -        | 22.2±0.2μs                 | 18.4±0.8μs                           |    0.83 | groupby.GroupByMethods.time_dtype_as_group('uint', 'count', 'direct', 1, 'cython')                                 |
| -        | 22.3±0.8μs                 | 18.3±0.1μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_field('float', 'count', 'direct', 1, 'cython')                                |
| -        | 21.6±0.9μs                 | 17.8±0.3μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_field('int16', 'count', 'direct', 1, 'cython')                                |
| -        | 22.0±0.6μs                 | 18.1±0.3μs                           |    0.82 | groupby.GroupByMethods.time_dtype_as_group('int', 'count', 'direct', 1, 'cython')                                  |
| -        | 21.9±0.7μs                 | 17.8±0.3μs                           |    0.81 | groupby.GroupByMethods.time_dtype_as_field('int', 'count', 'direct', 1, 'cython')                                  |
| -        | 22.0±0.2μs                 | 17.9±0.07μs                          |    0.81 | groupby.GroupByMethods.time_dtype_as_group('int16', 'count', 'direct', 1, 'cython')                                |
| -        | 21.6±0.1μs                 | 17.0±0.06μs                          |    0.79 | groupby.GroupByMethods.time_dtype_as_group('object', 'count', 'direct', 1, 'cython')                               |
| -        | 520±50μs                   | 370±30μs                             |    0.71 | arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function sub>)         |
| -        | 13.7±1ms                   | 1.80±0ms                             |    0.13 | groupby.MultipleCategories.time_groupby_nosort                                                                     |
| -        | 15.8±2ms                   | 1.73±0ms                             |    0.11 | groupby.MultipleCategories.time_groupby_extra_cat_nosort                                                           |
| -        | 13.7±0.2ms                 | 1.57±0.01ms                          |    0.11 | groupby.MultipleCategories.time_groupby_ordered_nosort                                                             |
| -        | 33.6±0.4ms                 | 1.40±0ms                             |    0.04 | groupby.MultipleCategories.time_groupby_sort                                                                       |
| -        | 35.5±1ms                   | 1.23±0ms                             |    0.03 | groupby.MultipleCategories.time_groupby_extra_cat_sort                                                             |
| -        | 33.3±0.8ms                 | 1.15±0ms                             |    0.03 | groupby.MultipleCategories.time_groupby_ordered_sort                                                               |

I haven't verified this, but I believe this would also make #55261 trivial to implement just by changing a few lines of result_index_and_ids