groupby producing incorrect results applying functions to int64 columns due to internal cast to float · Issue #6620 · pandas-dev/pandas (original) (raw)
dupe of #3707
groupby often casts int64 data to floats before applying functions to the grouped data, and then casts the result back to an int64. This produces incorrect results.
Here is a simple example:
import pandas as pd import numpy as np label = np.array([1, 2, 2, 3],dtype=np.int) data = np.array([1, 2, 3,4], dtype=np.int64) + 24650000000000000 z = pd.DataFrame({'a':label,'data':data}) f = z.groupby('a').first()
In [40]: print z a data 0 1 24650000000000001 1 2 24650000000000002 2 2 24650000000000003 3 3 24650000000000004
In [41]: print f
data
a
1 24650000000000000
2 24650000000000000
3 24650000000000004
The result is clearly incorrect and should be
data
a
1 24650000000000001
2 24650000000000002
3 24650000000000004
z.data and f.data are both numpy.int64, which masks the fact that z.data was converted to a float with a loss of precision during the groupby operation.
Tests of software using groupby on int64 will only show a problem if the tests include integers large enough to cause this loss of precision. Otherwise the tests will pass and the software will just quietly produce incorrect results in production.