REF: Back IntervalArray by array instead of Index by jbrockmendel · Pull Request #36310 · pandas-dev/pandas (original) (raw)

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

@jbrockmendel

The benefit I have in mind here is that we could back it by a single 2xN array and a) avoid the kludge needed to make __setitem__ atomic, b) do a view to get native types for e.g uniqueness checks, c) possibly share some methods with NDarrayBackedExtensionArray.

Also just in principle having EAs not depend on Index is preferable dependency-structure-wise.

cc @jschendel

@jbrockmendel

jreback

jreback

right = self.right.fillna(value=value.right)
from pandas import Index
left = Index(self.left).fillna(value=value.left)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm why do you need to coerce to .fillna?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ATM self.left is an ndarray which doesnt have fillna

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh

@jbrockmendel

AFAICT remaining test failures are on from arrow

@jbrockmendel

@jorisvandenbossche

It are our own tests (testing the conversion that is implemented in pandas itself), so that's something that need to be fixed in this PR.

Looking at the failure, it seems that setting a missing value in an IntervalArray does not/no longer set a missing value in both left and right arrays. But looking at the changes in __setitem__ in the diff, I don't directly see why that would be the case.

jorisvandenbossche

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For backwards compatibility, we could keep .left and .right returning an Index? (since the arrays are actually stored as ._left and ._right)

@jbrockmendel

@jbrockmendel

@TomAugspurger

Nice idea, +1 to changing the storage.

We shouldn't change IntervalArray.left from an index to ndarray without warning. I'm fine with just continuing to return an Index. Or if we really want to change it we can deprecate IntervalArray.left in favor of IntervalArray.left_array.

jreback

assert_index_equal(left.right, right.right, exact=exact, obj=f"{obj}.left")
if left._left.dtype.kind in ["m", "M"]:
# We have a DatetimeArray or Timed
# TODO: `exact` keyword?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe better to

kwargs = {}
if left._left.dtype.kind in ["m", "M"]:
    kwargs['check_freq'] = False
....
new_right = self.right.astype(dtype.subtype)
# We need to use Index rules for astype to prevent casting
# np.nan entries to int subtypes
new_left = Index(self._left).astype(dtype.subtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy=False ?

fill_value = self.left._na_value
from pandas import Index
fill_value = Index(self._left)._na_value

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we get the fill value rather than doing this? if we need to do this add copy=False

@cache_readonly
def left(self) -> Index:
return Index(self._values.left)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy=False on these?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this?

@jbrockmendel

i think comments have been addressed, LMK if i missed anything

jreback

_mask = None
_data: IntervalArray
_values: IntervalArray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm, now i am confused, what is different about these?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without this, mypy thinks _values is ExtensionArray and has a bunch of new complaints since we access self._values.left below

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, should try to remove this at some point

@cache_readonly
def left(self) -> Index:
return Index(self._values.left)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this?

@jbrockmendel

@jbrockmendel

@jreback

this is marked POC, but i assume want to merge it?

@jbrockmendel jbrockmendel changed the titlePOC: back IntervalArray by array instead of Index Back IntervalArray by array instead of Index

Sep 24, 2020

@jbrockmendel

this is marked POC, but i assume want to merge it?

Yes, updated the title.

@jbrockmendel

TomAugspurger

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good generally. Just one change needed in the whatsnew.

This should improve the performance of IntervalArray.__setitem__, right? That could be added in the release notes.

- `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`)
- :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`)
- :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`)
- :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs should only reference public functions. How would users actually pass this through a public API? I suspect it's impossible, since check_freq in assert_frame_equal applies to the index rather than values of an array.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So assert_extension_array_equal could perhaps take kwargs and pass it through. But that's maybe not worth the effort.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so remove this note?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?

if self.closed != other.closed:
return np.zeros(len(self), dtype=bool)
return (self.left == other.left) & (self.right == other.right)
return (self._left == other.left) & (self._right == other.right)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is other here known to be a particular type (like IntervalArray), or is it something like Union[IntervalArray, Series, Index,Interval]. If it's just IntervalArray it'd be a bit faster to compare with other._left and right._left.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point other can be an Interval or IntervalArray

@jbrockmendel

@jbrockmendel

@jbrockmendel

This should improve the performance of IntervalArray.setitem, right? That could be added in the release notes.

In [2]: ii = pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]) 
In [3]: ia = ii._data                                                           
In [4]: val = ia[0]                                                             

In [5]: %timeit ia[-1] = val                                                    
38.4 µs ± 1.74 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- master
23.9 µs ± 335 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

note added to whatsnew.

jreback

- `Styler` now allows direct CSS class name addition to individual data cells (:issue:`36159`)
- :meth:`Rolling.mean()` and :meth:`Rolling.sum()` use Kahan summation to calculate the mean to avoid numerical problems (:issue:`10319`, :issue:`11645`, :issue:`13254`, :issue:`32761`, :issue:`36031`)
- :meth:`DatetimeIndex.searchsorted`, :meth:`TimedeltaIndex.searchsorted`, :meth:`PeriodIndex.searchsorted`, and :meth:`Series.searchsorted` with datetimelike dtypes will now try to cast string arguments (listlike and scalar) to the matching datetimelike type (:issue:`36346`)
- :func:`pandas._testing.assert_datetime_array_equal` and :func:`pandas._testing.assert_timedelta_array_equal` now have a ``check_freq=True`` keyword that allows disabling the check for matching ``freq`` attribute (:issue:`36310`)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, unless this is user facing in some way (e.g. is assert_index_equal changed)?

@jbrockmendel

@jbrockmendel

@jorisvandenbossche jorisvandenbossche changed the titleBack IntervalArray by array instead of Index REF: Back IntervalArray by array instead of Index

Oct 1, 2020

jorisvandenbossche

# We have a DatetimeArray or TimedeltaArray
kwargs["check_freq"] = False
# TODO: `exact` keyword?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this TODO needs to be solved first? It was there before, but you now removed it? (so is not being ignored?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we're ok without the keyword, will remove the comment

from pandas.core.ops.array_ops import maybe_upcast_datetimelike_array
left = maybe_upcast_datetimelike_array(left)
left = extract_array(left, extract_numpy=True)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we are first ensuring that the arrays passed to _simple_new are an index, and then we extract the array again. Is this roundtrip to index and back needed?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can avoid the roundtrip eventually, will be best accomplished by being stricter in what we pass to _simple_new

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, would be a nice follow-up

right._values[key] = value_right
self._right = right
self._left[key] = value_left
self._right[key] = value_right # TODO: needs tests for not breaking views

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the un-xfail-ed test doing that?
(or if not, can you add a test now?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will remove comment

@jbrockmendel

@jbrockmendel

@jbrockmendel

jreback

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment

new_right = self.right.astype(dtype.subtype)
# We need to use Index rules for astype to prevent casting
# np.nan entries to int subtypes
new_left = Index(self._left, copy=False).astype(dtype.subtype)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could add copy=False to .astype (not sure how much any of this matters though)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge is ok now and can see if this matters on followup

_mask = None
_data: IntervalArray
_values: IntervalArray

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, should try to remove this at some point

jreback

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request

Nov 2, 2020

@jbrockmendel

Labels