pandas (original) (raw)

import pandas as pd

dti = pd.date_range("2016-01-01", periods=3, unit="s") item = pd.Timestamp("2016-02-03 04:05:06.789")

ser = pd.Series(dti)

ser[0] = item

assert ser[0] == item # <- raises bc item is rounded

mask = [False, True, True] res = ser.where(mask, item)

assert res[0] == item # <- raises bc item is rounded

ser2 = ser.copy() ser2[0] = pd.NaT res2 = ser2.fillna(item)

assert res2[0] == item # <- raises bc item is rounded

Issue Description

The operations above all go through DatetimeArray._validate_setitem_value -> _validate_scalar -> _unbox_scalar. _unbox_scalar does a value.as_unit(self.unit), which uses the default round_ok=True, so casts the ms-unit item to s-unit, which truncates.

(Something similar happens if we use a vector vec = pd.date_range(item, periods=3, unit="ms") instead of scalar item, but that also has other issues that I'll address later)

The apparent solution here is to do the as_unit casting in _validate_scalar, bug pass round_ok=False. This fixes the cases above (the __setitem__ case correctly issues a PDEP6 warning). But it breaks test_resample_ohlc_result, test_resample_dst_anchor2, and test_clip_with_timestamps_and_oob_datetimes.

In the clip test, the change causes us to raise in EABackedBlock.where, fall back to coerce_to_target_dtype, which identifies new_dtype and M8[ns], tries to cast, and raises OutOfBoundsDatetime. There are a few non-crazy options here. I lean toward doing a PDEP6-like deprecation for cases that go through this path (where, mask, clip, fillna).

The other cases are raising a KeyError on a DatetimeIndex.get_loc call when doing something like ser.loc[:"4-15-2000"]. I need to look into this more closely, tentatively looks solvable.

Last I want to make sure I write down the really ugly bug in where/fillna if we use a vector:

vec = pd.date_range(item, periods=3, unit="ms")
mask = [False, True, True]

res = ser.where(mask, vec)
arr = res._values
>>> arr.dtype == arr._ndarray.dtype  # <- raises, yikes!

This goes though NDArrayBackedExtensionArray._where, which does value = self._validate_setitem_value(value) which fails to cast the unit of vec, so gets back value which is a M8[ms] ndarray instead of M8[s]. We then pass it to np.where which gives back res_values with M8[ms] dtype, and then call _from_backing_data which assumes that res_values.dtype == self._ndarray.dtype. That assumption is wrong in this case and we get a corrupted EA back.

I've got a branch addressing this, just want to make sure this is written down somewhere.

cc @MarcoGorelli to make sure we're in agreement that the cases above should not silently truncate.