agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64 if all items in a single group are NaT · Issue #12821 · pandas-dev/pandas (original) (raw)

The example below shows two variations of a dataframe which contains a date column set to datetime64[ns] format.

In the first example, there is a single missing (NaT) date. After groupby and agg(), the dtypes of all the columns in the aggregated dataframe are the same as the original dataframe, as expected (and as desired).

However, in the second example, there are several missing dates, arranged so that all the dates in one group are NaT. After the same groupby and agg() procedures, the dtype of the date column is changed to float64. This is undesired behaviour in my situation and I believe it is a bug.

# Introduce single missing values in the date column
print('Datafreme with single missing date value')
print('========================================')
phjTempDF = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                          'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
                          'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
                          'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})

phjTempDF = phjTempDF.sort_values(['gender','age','date'])

phjTempDF.ix[1,'date'] = 'missing'

# Convert date to datetime64 format
phjTempDF['date'] = pd.to_datetime(phjTempDF['date'],errors='coerce')

print('\nWhole dataframe')
print('---------------')
print(phjzempdf)
print('\nOriginal types')

print('---------------')
print(phjTempDF.dtypes)

phjTempDF = phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).agg({'date': 'first','id': 'first'}).reset_index(drop=False)

print('\nAggregated dataframe')
print('--------------------')
print(phjzempdf)
print('\nPost-aggregation types')

print('-----------------------')
print(phjTempDF.dtypes)

# Introduce multiple missing values in the date column (one group contains all missing values)
# Introduce single missing values in the date column
print('\n\nDataframe with multiple missing dates values')
print('============================================')
phjTempDF = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                          'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
                          'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
                          'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})

phjTempDF = phjTempDF.sort_values(['gender','age','date'])

phjTempDF.ix[[1,2,5],'date'] = 'missing'

# Convert date to datetime64 format
phjTempDF['date'] = pd.to_datetime(phjTempDF['date'],errors='coerce')

print('\nWhole dataframe')
print('---------------')
print(phjzempdf)
print('\nOriginal types')#
print('---------------')
print(phjTempDF.dtypes)

phjTempDF = phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).agg({'date': 'first','id': 'first'}).reset_index(drop=False)

print('\nAggregated dataframe')
print('--------------------')
print(phjzempdf)
print('\nPost-aggregation types')#
print('-----------------------')
print(phjTempDF.dtypes)

The expected output would be for the dtypes in the dataframe after aggregation to be the same as those in the original dataframe.

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 1.5.6
setuptools: 3.6
Cython: None
numpy: 1.10.1
scipy: None
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: 2.3.0
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None