BUG: Regression on pandas 1.5.0 using describe with Int64 dtype getting a TypeError · Issue #48778 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd pd.DataFrame({'a': [1]}, dtype='Int64').describe()
Issue Description
With pandas version 1.5 describe
produces a TypeError exception when one of the columns is using a Int64 dtype and there is a single row.
import pandas as pd
pd.DataFrame({'a': [1]}, dtype='Int64').describe()
Traceback (most recent call last): File "", line 1, in File "PD15ENV/lib64/python3.9/site-packages/pandas/core/generic.py", line 10947, in describe return describe_ndframe( File "PD15ENV/lib64/python3.9/site-packages/pandas/core/describe.py", line 99, in describe_ndframe result = describer.describe(percentiles=percentiles) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/describe.py", line 179, in describe ldesc.append(describe_func(series, percentiles)) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/describe.py", line 246, in describe_numeric_1d return Series(d, index=stat_index, name=series.name, dtype=dtype) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/series.py", line 471, in init data = sanitize_array(data, index, dtype, copy) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/construction.py", line 623, in sanitize_array subarr = _try_cast(data, dtype, copy, raise_cast_failure) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/construction.py", line 841, in _try_cast subarr = np.array(arr, dtype=dtype, copy=copy) TypeError: float() argument must be a string or a number, not 'NAType'
This is specially relevant when using combination of groupby
and describe
as it will produce a type error if some of the groups has a single element.
The reason behind this bug is that the stdev is not defined for a single element, so it returns pd.NA:
>>> pd.Series([1], dtype='Int64').std()
<NA>
>>> type(_)
<class 'pandas._libs.missing.NAType'>
And then describe tries to create a series with all the statistics converting to float:
describe.py
d = (
[series.count(), series.mean(), series.std(), series.min()]
+ series.quantile(percentiles).tolist()
+ [series.max()]
)
# GH#48340 - always return float on non-complex numeric data
dtype = float if is_numeric_dtype(series) and not is_complex_dtype(series) else None
return Series(d, index=stat_index, name=series.name, dtype=dtype)
I have found a quick and dirty solution adding at line 246 of describe.py:
if dtype is float:
d = [x if x is not NA else np.nan for x in d]
But it is more a workaround than a proper solution.
With pandas 1.4.4 this code was working as expected:
pd.DataFrame({'a': [1]}, dtype='Int64').describe() a count 1 mean 1.0 std min 1 25% 1 50% 1 75% 1 max 1
Expected Behavior
Describe for single row dataframes shall not fail:
pd.DataFrame({'a': [1]}, dtype='Int64').describe() a count 1 mean 1.0 std min 1 25% 1 50% 1 75% 1 max 1
Going a little bit further, the question is why std
returns pd.NA and not np.nan
when called with a single element series:
pd.Series([1, 2], dtype='Int64').std() 0.7071067811865476 type() <class 'numpy.float64'> pd.Series([1], dtype='Int64').std() type() <class 'pandas._libs.missing.NAType'>
That shows that depending of the number of elements might return different type, while for float dtype, always return the same:
pd.Series([1]).std() nan type(_) <class 'numpy.float64'>
Installed Versions
INSTALLED VERSIONS ------------------ commit : 87cfe4epython : 3.9.7.final.0 python-bits : 64 OS : Linux OS-release : 5.3.18-lp152.72-default Version : #1 SMP Wed Apr 14 10:13:15 UTC 2021 (013936d) machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 65.4.0
pip : 22.2.2
Cython : None
pytest : 7.1.3
hypothesis : None
sphinx : 5.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None