combine_first loses index type information with MultiIndices and different timezones (original) (raw)

See title and example below. I believe this is due to the fact that combination of indices with different timezones first converts to object dtype, then rebases all timestamps to UTC for comparison and then constructs a DatetimeIndex from that. However, this doesn't seem to be applied for the individual levels in a MultiIndex. This is on latest stable 0.18.1.

In [3]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :tz1, tz2 = 'America/New_York', 'UTC' : :from1, to1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)], [pd.Timestamp('20160102', tz=tz1), pd.Timestamp('20160103', tz=tz1)] : :from2, to2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)], [pd.Timestamp('20160104', tz=tz2), pd.Timestamp('20160105', tz=tz2)] : :index1 = pd.MultiIndex.from_arrays([from1, to1]) :df1 = pd.DataFrame([1, 2], index=index1) : :index2 = pd.MultiIndex.from_arrays([from2, to2]) :df2 = pd.DataFrame([1, 2], index=index2) : :result = df1.combine_first(df2) :--

In [4]: df1.index.get_level_values(0) Out[4]: DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None)

In [5]: df2.index.get_level_values(0) Out[5]: DatetimeIndex(['2016-01-03', '2016-01-04'], dtype='datetime64[ns, UTC]', freq=None)

In [6]: result.index.get_level_values(0) Out[6]: Index([2016-01-01 00:00:00-05:00, 2016-01-02 00:00:00-05:00, 2016-01-03 00:00:00+00:00, 2016-01-04 00:00:00+00:00], dtype='object')

Works correctly if the inputs have the same timezone

In [12]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. :tz1, tz2 = 'America/New_York', 'America/New_York' : :from1, to1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)], [pd.Timestamp('20160102', tz=tz1), pd.Timestamp('20160103', tz=tz1)] : :from2, to2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)], [pd.Timestamp('20160104', tz=tz2), pd.Timestamp('20160105', tz=tz2)] : :index1 = pd.MultiIndex.from_arrays([from1, to1]) :df1 = pd.DataFrame([1, 2], index=index1) : :index2 = pd.MultiIndex.from_arrays([from2, to2]) :df2 = pd.DataFrame([1, 2], index=index2) : :result = df1.combine_first(df2) : :--

In [13]: result.index.get_level_values(0) Out[13]: DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00', '2016-01-03 00:00:00-05:00', '2016-01-04 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None)

Behavior is correct for single indices:

In [7]: %cpaste Pasting code; enter '--' alone on the line to stop or use Ctrl-D. : :tz1, tz2 = 'America/New_York', 'UTC' : :index1 = [pd.Timestamp('20160101', tz=tz1), pd.Timestamp('20160102', tz=tz1)] :index2 = [pd.Timestamp('20160103', tz=tz2), pd.Timestamp('20160104', tz=tz2)] : :df1 = pd.DataFrame([1, 2], index=index1) :df2 = pd.DataFrame([1, 2], index=index2) : :result = df1.combine_first(df2) :--

In [8]: df2.index Out[8]: DatetimeIndex(['2016-01-03', '2016-01-04'], dtype='datetime64[ns, UTC]', freq=None)

In [9]: df1.index Out[9]: DatetimeIndex(['2016-01-01 00:00:00-05:00', '2016-01-02 00:00:00-05:00'], dtype='datetime64[ns, America/New_York]', freq=None)

In [10]: result.index Out[10]: DatetimeIndex(['2016-01-01 05:00:00+00:00', '2016-01-02 05:00:00+00:00', '2016-01-03 00:00:00+00:00', '2016-01-04 00:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

output of pd.show_versions()

In [1]: import pandas as pd

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.6.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-88-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1 nose: 1.3.6 pip: 8.1.1 setuptools: 20.3 Cython: 0.22 numpy: 1.9.2 scipy: 0.17.0 statsmodels: 0.6.1.post1 xarray: None IPython: 3.1.0 sphinx: None patsy: 0.2.1 dateutil: 2.4.2 pytz: 2015.4 blosc: None bottleneck: 1.0.0 tables: None numexpr: 2.4.3 matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: None boto: None pandas_datareader: None