Partial Selection on MultiIndex: The need for empty slice support & dict indexing · Issue #4036 · pandas-dev/pandas (original) (raw)

related #4036, #4116
from SO with 0.14.0 multi-index slicers: http://stackoverflow.com/questions/24126542/pandas-multi-index-slices-for-level-names/24126676#24126676

Here's the example from there:

In [11]: midx = pd.MultiIndex.from_product([list(range(3)),['a','b','c'],pd.date_range('20130101',periods=3)],names=['numbers','letters','dates'])

In [12]: midx.names.index('letters')
Out[12]: 1

In [13]: midx.names.index('dates')
Out[13]: 2

Here's a complete example

In [18]: df = DataFrame(np.random.randn(len(midx),1),index=midx)

In [19]: df
Out[19]: 
                                   0
numbers letters dates               
0       a       2013-01-01  0.261092
                2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-01 -1.515866
                2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-01 -0.253103
                2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-01 -0.108325
                2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-01 -1.922214
                2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-01 -0.419775
                2013-01-02  1.511700
                2013-01-03  0.994332
2       a       2013-01-01 -0.020299
                2013-01-02 -0.749474
                2013-01-03 -1.478558
        b       2013-01-01 -1.357671
                2013-01-02  0.161185
                2013-01-03 -0.658246
        c       2013-01-01 -0.564796
                2013-01-02 -0.333106
                2013-01-03 -2.814611

This is your dict of level names -> slices

In [20]: slicers = { 'numbers' : slice(0,1), 'dates' : slice('20130102','20130103') }

This creates an indexer that is empty (selects everything)

In [21]: indexer = [ slice(None) ] * len(df.index.levels)

Add in your slicers

In [22]: for n, idx in slicers.items():
              indexer[df.index.names.index(n)] = idx

And select (this has to be a tuple, but was a list to start as we had to modify it)

In [23]: df.loc[tuple(indexer),:]
Out[23]: 
                                   0
numbers letters dates               
0       a       2013-01-02 -1.267770
                2013-01-03  0.008230
        b       2013-01-02  0.351942
                2013-01-03 -0.245463
        c       2013-01-02 -0.385411
                2013-01-03 -1.740821
1       a       2013-01-02 -0.212350
                2013-01-03  0.021097
        b       2013-01-02 -1.769003
                2013-01-03 -0.594216
        c       2013-01-02  1.511700
                2013-01-03  0.994332

I use hierarchical indices regularly with pandas DataFrames and Series objects. It is invaluable to be able to partially select subsets of rows based on a set of arbitrary index values, and retain the index information for subsequent groupby operations etc.

I am looking for an elegant way to pass an ordered tuple (with possibly empty slices) or an arbitrary dict of {index_level_name:value,...} pairs to select rows matching the passed index:value pairs. Note: I am aware that with Boolean indexing on data columns and nested np.logical_and() statements you can construct such a Boolean select index. I'm looking for an elegant solution using indexes & levels to avoid repeatedly using df.reset_index and building Boolean arrays. Also, df.xs() does not work in every situation (see below) and does not exist for Series with MultiIndex.

To explain this lets create a DataFrame with 5 index levels:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: print pd.version 0.10.0

In [4]: # Generate Test DataFrame ...: NUM_ROWS = 100000 ...:

In [5]: NUM_COLS = 10

In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [7]: index_cols = col_names[:5]

In [8]: # Set DataFrame to have 5 level Hierarchical Index. ...: # The dtype does not matter try str or np.int64 same results. ...: # Sort the index! ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names) ...:

In [9]: df = df.set_index(index_cols).sort_index()

In [10]: df Out[10]: <class 'pandas.core.frame.DataFrame'> MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4) Data columns: A5 100000 non-null values A6 100000 non-null values A7 100000 non-null values A8 100000 non-null values A9 100000 non-null values dtypes: int64(5)

In [11]: df.index.names Out[11]: ['A0', 'A1', 'A2', 'A3', 'A4']

Now index on every level and we get back the rows we want :) I love that I get back the complete index too because it may be useful later.

In [12]: df.ix[(0,1,2,3,4)] Out[12]: A5 A6 A7 A8 A9 A0 A1 A2 A3 A4
0 1 2 3 4 1 3 1 1 3 4 4 3 4 2 4 4 0 2 3 1 3 ... 4 1 1 3 4 3 4 0 1 2 4 1

Now if we index on the first 4 levels we get back something different, a data frame with the first 4 index levels dropped. It would be nice to have the option to keep all index levels even though they are repetitive (like above).

In [13]: df.ix[(0,1,2,3)] Out[13]: <class 'pandas.core.frame.DataFrame'> Int64Index: 144 entries, 0 to 4 Data columns: A5 144 non-null values A6 144 non-null values A7 144 non-null values A8 144 non-null values A9 144 non-null values dtypes: int64(5)

Now comes the tricky part. What if I only want to index on the first and last 2 index levels, and want everything from the 3rd level? Empty slicing is not supported.

In [14]: df.ix[(0,1,:,3,4)] File "", line 1 df.ix[(0,1,:,3,4)] ^ SyntaxError: invalid syntax

In [15]: df.ix[(0,1,slice(None),3,4)]

IndexingError Traceback (most recent call last) in () ----> 1 df.ix[(0,1,slice(None),3,4)]

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in getitem(self, key) 30 pass 31 ---> 32 return self._getitem_tuple(key) 33 else: 34 return self._getitem_axis(key, axis=0)

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup) 212 for i, key in enumerate(tup): 213 if i >= self.obj.ndim: --> 214 raise IndexingError('Too many indexers') 215 216 if _is_null_slice(key):

IndexingError: Too many indexers

df.xs can somewhat help here but its useless for MultiIndex on a series. And it drops the indexed levels leaving you unsure to what fixed index levels you have drilled to. :(

In [16]: df.xs((0,2,3),level=df.index.names[::2]) Out[16]: <class 'pandas.core.frame.DataFrame'> MultiIndex: 805 entries, (0, 0) to (4, 4) Data columns: A5 805 non-null values A6 805 non-null values A7 805 non-null values A8 805 non-null values A9 805 non-null values dtypes: int64(5)

Interestingly df.xs() is not consistant, because you cannot explicitly index on every level giving it the list of all level names:

In [17]: df.xs((0,1,2,3,4), level=df.index.names)

AttributeError Traceback (most recent call last) in () ----> 1 df.xs((0,1,2,3,4), level=df.index.names)

C:\Python27\lib\site-packages\pandas\core\frame.pyc in xs(self, key, axis, level, copy) 2233 labels = self._get_axis(axis) 2234 if level is not None: -> 2235 loc, new_ax = labels.get_loc_level(key, level=level) 2236 2237 if not copy and not isinstance(loc, slice):

C:\Python27\lib\site-packages\pandas\core\index.pyc in get_loc_level(self, key, level) 2193 2194 result = loc if result is None else result & loc -> 2195 return result, _drop_levels(result, level) 2196 2197 level = self._get_level_number(level)

C:\Python27\lib\site-packages\pandas\core\index.pyc in _drop_levels(indexer, levels) 2177 levels = [self._get_level_number(i) for i in levels] 2178 for i in sorted(levels, reverse=True): -> 2179 new_index = new_index.droplevel(i) 2180 return new_index 2181

AttributeError: 'Int64Index' object has no attribute 'droplevel'

However df.xs without the level attribute on all index levels works as expected...

In [18]: df.xs((0,1,2,3,4))
Out[18]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   3   1   1   3
            4    4   3   4   2   4
            4    0   2   3   1   3
           ...
            4    1   1   3   4   3
            4    0   1   2   4   1

Thoughts:
One (somewhat limiting) solution could be allowing df.ix[(0,1,3,:,4)] to take an empty slice for an index level and return the data frame indexed on only the the passed index levels that are known. Today this capability does not exist, although an ordered partial list of index levels works.

The next and more general approach could be to pass a dict of df.ix[{index_level:value}] pairs and return the rows where the specified index levels equal the passed values. Unspecified levels are not filtered down and we have the option to return all index levels.