BUG: GroupBy.describe
produces inconsistent results for empty datasets · Issue #41575 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame(columns=['A', 'B', 'C'])
In [3]: df.groupby('A').B.describe()
ValueError Traceback (most recent call last) in ----> 1 df.groupby('A').B.describe()
~/.../python3.8/site-packages/pandas/core/groupby/generic.py in describe(self, **kwargs)
674 if self.axis == 1:
675 return result.T
--> 676 return result.unstack()
677
678 def value_counts(
~/.../python3.8/site-packages/pandas/core/series.py in unstack(self, level, fill_value)
3827 from pandas.core.reshape.reshape import unstack
3828
-> 3829 return unstack(self, level, fill_value)
3830
3831 # ----------------------------------------------------------------------
~/.../python3.8/site-packages/pandas/core/reshape/reshape.py in unstack(obj, level, fill_value) 422 # Give nicer error messages when unstack a Series whose 423 # Index is not a MultiIndex. --> 424 raise ValueError( 425 f"index must be a MultiIndex to unstack, {type(obj.index)} was passed" 426 )
ValueError: index must be a MultiIndex to unstack, <class 'pandas.core.indexes.base.Index'> was passed
In [4]: df.groupby('A').describe() Out[4]: Series([], dtype: float64)
Problem description
SeriesGroupBy.describe
raises an error when called on an empty dataset, and DataframeGroupBy.describe
succeeds, but returns an empty Series
.
Expected Output
I would expect both of these to return an empty DataFrame
with the appropriate columns.
In [3]: df.groupby('A').B.describe() Out [3]: Empty DataFrame Columns: [count, mean, std, min, 25%, 50%, 75%, max] Index: []
In [4]: df.groupby('A').describe() Out [4]: Empty DataFrame Columns: [(B, count), (B, mean), (B, std), (B, min), (B, 25%), (B, 50%), (B, 75%), (B, max)(C, count), (C, mean), (C, std), (C, min), (C, 25%), (C, 50%), (C, 75%), (C, max)] Index: []
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit : 2cb9652python : 3.8.6.final.0 python-bits : 64 OS : Linux OS-release : 5.10.26-1rodete1-amd64 Version : #1 SMP Debian 5.10.26-1rodete1 (2021-04-12) machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.2.4
numpy : 1.19.5
pytz : 2019.3
dateutil : 2.8.1
pip : 20.2.1
setuptools : 49.2.1
Cython : 0.29.13
pytest : 4.6.11
hypothesis : None
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.20
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None