groupby producing incorrect results applying functions to int64 columns due to internal cast to float · Issue #6620 · pandas-dev/pandas (original) (raw)

dupe of #3707

groupby often casts int64 data to floats before applying functions to the grouped data, and then casts the result back to an int64. This produces incorrect results.
Here is a simple example:

import pandas as pd import numpy as np label = np.array([1, 2, 2, 3],dtype=np.int) data = np.array([1, 2, 3,4], dtype=np.int64) + 24650000000000000 z = pd.DataFrame({'a':label,'data':data}) f = z.groupby('a').first()

In [40]: print z a data 0 1 24650000000000001 1 2 24650000000000002 2 2 24650000000000003 3 3 24650000000000004

In [41]: print f data a
1 24650000000000000 2 24650000000000000 3 24650000000000004

The result is clearly incorrect and should be

            data

a
1 24650000000000001 2 24650000000000002 3 24650000000000004

z.data and f.data are both numpy.int64, which masks the fact that z.data was converted to a float with a loss of precision during the groupby operation.
Tests of software using groupby on int64 will only show a problem if the tests include integers large enough to cause this loss of precision. Otherwise the tests will pass and the software will just quietly produce incorrect results in production.