Calling sum with min_count on SeriesGroupBy with dtype Int64 gives large negative value rather than pd.NA · Issue #32861 · pandas-dev/pandas (original) (raw)

test_df = pd.DataFrame({'foo' : ['a'], 'bar': [1]}) test_df['bar'] = test_df['bar'].astype('Int64') test_df.groupby('foo')['bar'].sum(min_count=2)

output:

foo a -9223372036854775808 Name: bar, dtype: Int64

Problem description

Per the documentation, sum should return NA if there are fewer than min_count values. This works fine on the dataframe itself:

test_df['bar'].sum(min_count=2)

output

nan

but gives what looks like an overflow error when called after groupby.

I ran into this with real data when calling with min_count=1 on a dataframe where some of the values were missing, but I thought the minimal example above was clearer.

Expected Output

foo
a NaN
Name: bar, dtype: Int64

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-74-generic
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 1.0.3
numpy : 1.18.1
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : None
pytest : 5.3.4
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.3.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.8.1
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.4
pyxlsb : None
s3fs : None
scipy : 1.3.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None