BUG: df.apply handles np.timedelta64 as timestamp, should be timedelta (original) (raw)

I think there may be a bug with the row-wise handling of numpy.timedelta64 data types when using DataFrame.apply. As a check, the problem does not appear when using DataFrame.applymap. The problem may be related to #4532, but I'm unsure. I've included an example below.

This is only a minor problem for my use-case, which is cross-checking timestamps from a counter/timer card. I can easily work around the issue with DataFrame.itertuples etc.

Thank you for your time and for making such a useful package!

Example

Version

Import and check versions.

$ date
Thu Jul 17 16:28:38 CDT 2014
$ conda update pandas
Fetching package metadata: ..
# All requested packages already installed.
# packages in environment at /Users/harrold/anaconda:
#
pandas                    0.14.1               np18py27_0  
$ ipython
Python 2.7.8 |Anaconda 2.0.1 (x86_64)| (default, Jul  2 2014, 15:36:00) 
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from __future__ import print_function

In [2]: import numpy as np

In [3]: import pandas as pd

In [4]: pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Darwin
OS-release: 11.4.2
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: 0.999
httplib2: 0.8
apiclient: 1.2
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
Create test data

Using subset of original raw data as example.

In [5]: datetime_start = np.datetime64(u'2014-05-31T01:23:19.9600345Z')

In [6]: timedeltas_elapsed = [30053400, 40053249, 50053098]

Compute datetimes from elapsed timedeltas, then create differential timedeltas from datetimes. All elements are either type numpy.datetime64 or numpy.timedelta64.

In [7]: df = pd.DataFrame(dict(datetimes = timedeltas_elapsed))

In [8]: df = df.applymap(lambda elt: np.timedelta64(elt, 'us'))

In [9]: df = df.applymap(lambda elt: np.datetime64(datetime_start + elt))

In [10]: df['differential_timedeltas'] = df['datetimes'] - df['datetimes'].shift()

In [11]: print(df)
                      datetimes  differential_timedeltas
0 2014-05-31 01:23:50.013434500                      NaT
1 2014-05-31 01:24:00.013283500          00:00:09.999849
2 2014-05-31 01:24:10.013132500          00:00:09.999849
Expected behavior

With element-wise handling using DataFrame.applymap, all elements are correctly identified as datetimes (timestamps) or timedeltas.

In [12]: print(df.applymap(lambda elt: type(elt)))
                          datetimes     differential_timedeltas
0  <class 'pandas.tslib.Timestamp'>  <type 'numpy.timedelta64'>
1  <class 'pandas.tslib.Timestamp'>  <type 'numpy.timedelta64'>
2  <class 'pandas.tslib.Timestamp'>  <type 'numpy.timedelta64'>
Bug

With row-wise handling using DataFrame.apply, all elements are type pandas.tslib.Timestamp. I expected 'differential_timedeltas' to be type numpy.timedelta64 or another type of timedelta, not a type of datetime (timestamp).

In [13]: # For 'datetimes':

In [14]: print(df.apply(lambda row: type(row['datetimes']), axis=1))
0    <class 'pandas.tslib.Timestamp'>
1    <class 'pandas.tslib.Timestamp'>
2    <class 'pandas.tslib.Timestamp'>
dtype: object

In [15]: # For 'differential_timedeltas':

In [16]: print(df.apply(lambda row: type(row['differential_timedeltas']), axis=1))
0      <class 'pandas.tslib.NaTType'>
1    <class 'pandas.tslib.Timestamp'>
2    <class 'pandas.tslib.Timestamp'>
dtype: object