Boolean masks on dataframes with indexes with timezones introduces NaNs · Issue #16889 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import pandas as pd import numpy as np mask = np.array([False, True, True, False])
idx = pd.date_range('20010101', '20020101')[:4].tz_localize('UTC') df = pd.DataFrame({'a' : np.arange(4)}, index=idx).astype('float64') fill = 2 * df df.loc[mask, :] = fill.loc[mask, :] print df
idx = pd.date_range('20010101', '20020101')[:4] df = pd.DataFrame({'a' : np.arange(4)}, index=idx).astype('float64') fill = 2 * df df.loc[mask, :] = fill.loc[mask, :] print df
Problem description
This is the output that I get:
a
2001-01-01 00:00:00+00:00 0
2001-01-02 00:00:00+00:00 NaN
2001-01-03 00:00:00+00:00 NaN
2001-01-04 00:00:00+00:00 3
a
2001-01-01 0
2001-01-02 2
2001-01-03 4
2001-01-04 3
In the first case, a timezone of UTC seems to cause the masking operation to introduce NaNs. The second case is correct. This problem seems to be introduced in pandas 0.17.1, whereas with 0.16 I get the following output below
Expected Output
a
2001-01-01 00:00:00+00:00 0
2001-01-02 00:00:00+00:00 2
2001-01-03 00:00:00+00:00 4
2001-01-04 00:00:00+00:00 3
a
2001-01-01 0
2001-01-02 2
2001-01-03 4
2001-01-04 3
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.17.1
nose: None
pip: 9.0.1
setuptools: 27.2.0
Cython: None
numpy: 1.10.2
scipy: 0.16.0
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
Jinja2: None