BUG: Inconsistent types for groupby group names · Issue #58858 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import pandas as pd

df = pd.DataFrame({'x': [10, 20, 30], 'y': ['a', 'b', 'c']})

group by atomic - names and keys are atomic

name, _ = next(iter(df.groupby('x'))) name # 10 as expected df.groupby('x').groups # {10: [0], 20: [1], 30: [2]} as expected

group by list of 2 - names and keys are 2-tuples

name, _ = next(iter(df.groupby(['x', 'y']))) name # (10, 'a') as expected df.groupby(['x', 'y']).groups # {(10, 'a'): [0], (20, 'b'): [1], (30, 'c'): [2]} as expected

group by list of 1 - names are 1-tuples but keys are atomic!?

name, _ = next(iter(df.groupby(['x']))) name # (10,) as expected df.groupby(['x']).groups # {10: [0], 20: [1], 30: [2]} UNEXPECTED

Issue Description

In pandas 1.5.x grouping by a single element yielded scalar group names, grouping by a list of more than one element yielded tuples names but grouping by a length-1 list yielded scalar group names again which was considered inconsistent (and I agree).

This was mostly resolved in 2.0 (see #42795) and now when iterating through groups the group names are 1-tuples when grouped with a length-1 list. However the groups attribute still returns scalars instead of 1-tuples which I consider to be a bug or at least requiring enhancement.

Expected Behavior

I expect group names / keys to be 1-tuples when grouping with a length-1 list: group by scalar get scalar names, group by list get tuple names. (I would also propose that grouping by an empty list should return a single group of the whole dataframe, with an empty tuple as the key, but thats out of scope for this issue!)

I expect the same group name / key behaviour whether iterating through groups or accessing groups via the groups attribute.

Installed Versions

latest release (2.2.2)

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.14.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-1064-azure
Version : #73~20.04.1-Ubuntu SMP Mon May 6 09:43:44 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 65.5.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

dev install (3.0.0.dev0+1024.gb16233155)

INSTALLED VERSIONS

commit : b162331
python : 3.10.14.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-1064-azure
Version : #73~20.04.1-Ubuntu SMP Mon May 6 09:43:44 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0.dev0+1024.gb16233155
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 65.5.0
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None