BUG: Regression on pandas 1.5.0 using describe with Int64 dtype getting a TypeError · Issue #48778 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import pandas as pd pd.DataFrame({'a': [1]}, dtype='Int64').describe()

Issue Description

With pandas version 1.5 describe produces a TypeError exception when one of the columns is using a Int64 dtype and there is a single row.

import pandas as pd

pd.DataFrame({'a': [1]}, dtype='Int64').describe()

Traceback (most recent call last): File "", line 1, in File "PD15ENV/lib64/python3.9/site-packages/pandas/core/generic.py", line 10947, in describe return describe_ndframe( File "PD15ENV/lib64/python3.9/site-packages/pandas/core/describe.py", line 99, in describe_ndframe result = describer.describe(percentiles=percentiles) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/describe.py", line 179, in describe ldesc.append(describe_func(series, percentiles)) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/describe.py", line 246, in describe_numeric_1d return Series(d, index=stat_index, name=series.name, dtype=dtype) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/series.py", line 471, in init data = sanitize_array(data, index, dtype, copy) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/construction.py", line 623, in sanitize_array subarr = _try_cast(data, dtype, copy, raise_cast_failure) File "PD15ENV/lib64/python3.9/site-packages/pandas/core/construction.py", line 841, in _try_cast subarr = np.array(arr, dtype=dtype, copy=copy) TypeError: float() argument must be a string or a number, not 'NAType'

This is specially relevant when using combination of groupby and describe as it will produce a type error if some of the groups has a single element.

The reason behind this bug is that the stdev is not defined for a single element, so it returns pd.NA:

>>> pd.Series([1], dtype='Int64').std()
<NA>
>>> type(_)
<class 'pandas._libs.missing.NAType'>

And then describe tries to create a series with all the statistics converting to float:
describe.py

d = (
    [series.count(), series.mean(), series.std(), series.min()]
    + series.quantile(percentiles).tolist()
    + [series.max()]
)
# GH#48340 - always return float on non-complex numeric data
dtype = float if is_numeric_dtype(series) and not is_complex_dtype(series) else None
return Series(d, index=stat_index, name=series.name, dtype=dtype)

I have found a quick and dirty solution adding at line 246 of describe.py:

 if dtype is float:
     d = [x if x is not NA else np.nan for x in d]

But it is more a workaround than a proper solution.

With pandas 1.4.4 this code was working as expected:

pd.DataFrame({'a': [1]}, dtype='Int64').describe() a count 1 mean 1.0 std min 1 25% 1 50% 1 75% 1 max 1

Expected Behavior

Describe for single row dataframes shall not fail:

pd.DataFrame({'a': [1]}, dtype='Int64').describe() a count 1 mean 1.0 std min 1 25% 1 50% 1 75% 1 max 1

Going a little bit further, the question is why std returns pd.NA and not np.nan when called with a single element series:

pd.Series([1, 2], dtype='Int64').std() 0.7071067811865476 type() <class 'numpy.float64'> pd.Series([1], dtype='Int64').std() type() <class 'pandas._libs.missing.NAType'>

That shows that depending of the number of elements might return different type, while for float dtype, always return the same:

pd.Series([1]).std() nan type(_) <class 'numpy.float64'>

Installed Versions

INSTALLED VERSIONS ------------------ commit : 87cfe4epython : 3.9.7.final.0 python-bits : 64 OS : Linux OS-release : 5.3.18-lp152.72-default Version : #1 SMP Wed Apr 14 10:13:15 UTC 2021 (013936d) machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.5.0
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 65.4.0
pip : 22.2.2
Cython : None
pytest : 7.1.3
hypothesis : None
sphinx : 5.1.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.6.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.9.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
tzdata : None