BUG: InvalidIndexError: Reindexing only valid with uniquely valued Index objects on to_datetime · Issue #39882 · pandas-dev/pandas (original) (raw)

Problem description

I am expriencing a weird bug while trying to convert a list to datetime.
Here is a bug repro:

import pandas as pd s = pd.Series([None, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.Timestamp('2016-06-07 20:07:42'), pd.NaT, pd.Timestamp('2016-05-04 20:09:22'), pd.Timestamp('2016-04-12 20:07:40'), pd.Timestamp('2016-03-30 20:10:39'), pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.Timestamp('2015-12-18 20:14:06'), pd.Timestamp('2015-12-15 20:07:59'), pd.Timestamp('2015-11-17 20:09:31'), pd.Timestamp('2015-11-05 20:10:41'), pd.Timestamp('2015-10-29 20:12:12'), pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.Timestamp('2015-07-20 20:10:23'), pd.Timestamp('2015-06-16 20:07:24'), pd.Timestamp('2015-05-28 20:09:39'), pd.NaT, pd.NaT, pd.Timestamp('2015-04-15 20:12:54'), pd.NaT, pd.NaT, pd.NaT, pd.Timestamp('2015-02-20 20:12:07'), pd.Timestamp('2014-12-29 20:04:30'), pd.Timestamp('2014-12-15 20:09:24'), pd.NaT, pd.NaT, pd.Timestamp('2014-11-17 20:03:56'), pd.NaT, pd.Timestamp('2014-10-14 20:04:47'), pd.NaT, pd.NaT, pd.NaT, pd.NaT, pd.Timestamp('2014-08-21 20:06:15'), pd.NaT, pd.Timestamp('2012-07-26 00:10:29')], dtype="object") pd.to_datetime(s, errors="coerce")

What I got is :

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

Testing on other example, the behavior is working as expected:

import pandas as pd s = pd.Series([None, pd.NaT, pd.Timestamp('2014-11-17 20:03:56')], dtype="object") pd.to_datetime(s, errors="coerce")

gives me

0                   NaT
1                   NaT
2   2014-11-17 20:03:56
dtype: datetime64[ns]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 7d32926
python : 3.7.4.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Tue Nov 10 00:10:30 PST 2020; root:xnu-6153.141.10~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.2
numpy : 1.19.4
pytz : 2020.4
dateutil : 2.8.1
pip : 20.2.4
setuptools : 40.8.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.19.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.5
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 2.0.0
pyxlsb : None
s3fs : 0.2.2
scipy : 1.5.4
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
numba : None

Info while scrolling a bit with pdb:

> /Users/jfournier/.pyenv/versions/3.7.4/lib/python3.7/site-packages/pandas/core/tools/datetimes.py(803)to_datetime()
    801         cache_array = _maybe_cache(arg, format, cache, convert_listlike)
    802         if not cache_array.empty:
--> 803             result = arg.map(cache_array)
    804         else:
    805             values = convert_listlike(arg._values, format)

ipdb> cache_array
NaT                                   NaT
NaT                                   NaT
2016-06-07 20:07:42   2016-06-07 20:07:42
2016-05-04 20:09:22   2016-05-04 20:09:22
2016-04-12 20:07:40   2016-04-12 20:07:40
2016-03-30 20:10:39   2016-03-30 20:10:39
2015-12-18 20:14:06   2015-12-18 20:14:06
2015-12-15 20:07:59   2015-12-15 20:07:59
2015-11-17 20:09:31   2015-11-17 20:09:31
2015-11-05 20:10:41   2015-11-05 20:10:41
2015-10-29 20:12:12   2015-10-29 20:12:12
2015-07-20 20:10:23   2015-07-20 20:10:23
2015-06-16 20:07:24   2015-06-16 20:07:24
2015-05-28 20:09:39   2015-05-28 20:09:39
2015-04-15 20:12:54   2015-04-15 20:12:54
2015-02-20 20:12:07   2015-02-20 20:12:07
2014-12-29 20:04:30   2014-12-29 20:04:30
2014-12-15 20:09:24   2014-12-15 20:09:24
2014-11-17 20:03:56   2014-11-17 20:03:56
2014-10-14 20:04:47   2014-10-14 20:04:47
2014-08-21 20:06:15   2014-08-21 20:06:15
2012-07-26 00:10:29   2012-07-26 00:10:29
dtype: datetime64[ns]

it seems that there is a deduplication which distinguishes None from NaT, then None got converted to NaT, and ultimately the self.is_unique check fails as NaT is present twice