Proposal: New Index type for binned data (IntervalIndex) (original) (raw)

Design

The idea is to have a natural representation of the grids that ubiquitously appear in simulations and measurements of physical systems. Instead of referencing a single value, a grid cell references a range of values, based on the chosen discretization. Typically, cells boundaries would be specified by floating point numbers. In one dimension, a grid cell corresponds to an interval, the name we use here.

The key feature of IntervalIndex is that looking up an indexer should return all intervals in which the indexer's values fall. FloatIndex is a poor substitute, because of floating point precision issues, and because I don't want to label values by a single point.

A IntervalIndex is uniquely identified by its intervals and closed ('left' or 'right') properties, an ndarray of shape (len(idx), 2), indicating each interval. Other useful properties for IntervalIndex would include left, right and mid, which should return arrays (indexes?) corresponding to the left, right or mid-points of each interval.

The constructor should allow the optional keyword argument breaks (an array of length len(idx) + 1) to specified instead of intervals.

It's not entirely obvious what idx.values should be (idx.mid? strings like '(0, 1]'? an array of tuples or Interval objects?). I think the most useful choice for cross compatibility would probably be to an ndarray like idx.mid.

IntervalIndex should support mathematical operations (e.g., idx + 1), which are calculated by vectorizing the operation over the breaks.

Examples

An example already in pandas that should be a IntervalIndex is the levels property of categorical returned by cut, which is currently an object array of strings:

>>> pd.cut([], [0, 5, 10]).levels
Index([u'(0, 5]', u'(5, 10]'], dtype='object')

Example usage:

should be equivalent to pd.cut([], [0, 1, 2]).levels
idx = IntervalIndex(intervals=[(0, 1), (1, 2)]) idx2 = IntervalIndex(breaks=[0, 1, 2]) # equivalent idx IntervalIndex([(0, 1), (1, 2)], closed='right') idx.left np.array([0, 1]) idx.right np.array([1, 2]) idx.mid np.array([0.5, 1.5]) s = pd.Series([1, 2], idx) (0, 1] 1 (1, 2] 2 dtype: int64 s.loc[1] 1 s.loc[0.5] 1 s.loc[0] KeyError

Implementation

A IntervalIndex would be a monotonic and non-overlapping one-dimensional array of intervals. It is not required to be contiguous. A scalar Interval would correspond to a contiguous interval between start and stop values (e.g., given by integers, floating point numbers or datetimes).

For index lookups, I propose to do a binary search (np.searchsorted) on idx.left. If we add the constraint that all intervals must have a fixed width, we could calculate the bin using a formula in constant time, but I'm not sure the loss in flexibility would be worth the speedup.

IntervalIndex should play nicely when used as the levels for Categorical variable (#7217), but it is not the same as a CategoricalIndex (#7629). For example, a IntervalIndex should not allow for redundant values. To represent redundant or non-continuous intervals, you would need to make in a Categorical or CategoricalIndex which uses a IntervalIndex for the levels. Calling df.reset_index() on an DataFrame with an IntervalIndex would create a new Categorical column.

Note: I'm not entirely sure if this design doc belongs here or on mailing list (I'm happy to post it there if requested).

Here is the comment where I brought this up previously: #5460 (comment)

CC @hugadams -- I expect IntervalIndex would be very handy for your pyuvvis.