Advanced Indexing — pandas 0.18.1 documentation (original) (raw)

MultiIndex / Advanced Indexing

This section covers indexing with a MultiIndex and more advanced indexing features.

See the Indexing and Selecting Data for general indexing documentation.

Warning

Whether a copy or a reference is returned for a setting operation, may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy

Warning

In 0.15.0 Index has internally been refactored to no longer sub-class ndarraybut instead subclass PandasObject, similarly to the rest of the pandas objects. This should be a transparent change with only very limited API implications (See the Internal Refactoring)

See the cookbook for some advanced strategies

Hierarchical indexing (MultiIndex)

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with the all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.

See the cookbook for some advanced strategies

Creating a MultiIndex (hierarchical index) object

The MultiIndex object is the hierarchical analogue of the standardIndex object which typically stores the axis labels in pandas objects. You can think of MultiIndex an array of tuples where each tuple is unique. AMultiIndex can be created from a list of arrays (usingMultiIndex.from_arrays), an array of tuples (usingMultiIndex.from_tuples), or a crossed set of iterables (usingMultiIndex.from_product). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demo different ways to initialize MultiIndexes.

In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ...: ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] ...:

In [2]: tuples = list(zip(*arrays))

In [3]: tuples Out[3]: [('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

In [5]: index Out[5]: MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]], names=[u'first', u'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s Out[7]: first second bar one 0.469112 two -0.282863 baz one -1.509059 two -1.135632 foo one 1.212112 two -0.173215 qux one 0.119209 two -1.044236 dtype: float64

When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product function:

In [8]: iterables = [['bar', 'baz', 'foo', 'qux'], ['one', 'two']]

In [9]: pd.MultiIndex.from_product(iterables, names=['first', 'second']) Out[9]: MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]], names=[u'first', u'second'])

As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

In [10]: arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']), ....: np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])] ....:

In [11]: s = pd.Series(np.random.randn(8), index=arrays)

In [12]: s Out[12]: bar one -0.861849 two -2.104569 baz one -0.494929 two 1.071804 foo one 0.721555 two -0.706771 qux one -1.039575 two 0.271860 dtype: float64

In [13]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [14]: df Out[14]: 0 1 2 3 bar one -0.424972 0.567020 0.276232 -1.087401 two -0.673690 0.113648 -1.478427 0.524988 baz one 0.404705 0.577046 -1.715002 -1.039268 two -0.370647 -1.157892 -1.344312 0.844885 foo one 1.075770 -0.109050 1.643563 -1.469388 two 0.357021 -0.674600 -1.776904 -0.968914 qux one -1.294524 0.413738 0.276662 -0.472035 two -0.013960 -0.362543 -0.006154 -0.923061

All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

In [15]: df.index.names Out[15]: FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levelsof the index is up to you:

In [16]: df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)

In [17]: df Out[17]: first bar baz foo qux
second one two one two one two one
A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299
B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127
C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466

first
second two
A -0.226169
B -1.436737
C -2.006747

In [18]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6]) Out[18]: first bar baz foo
second one two one two one two first second
bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804 two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738 two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441

We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes.

It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

In [19]: pd.Series(np.random.randn(8), index=tuples) Out[19]: (bar, one) -1.236269 (bar, two) 0.896171 (baz, one) -0.487602 (baz, two) -0.082240 (foo, one) -2.182937 (foo, two) 0.380396 (qux, one) 0.084844 (qux, two) 0.432390 dtype: float64

The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating aMultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

Note that how the index is displayed by be controlled using themulti_sparse option in pandas.set_printoptions:

In [20]: pd.set_option('display.multi_sparse', False)

In [21]: df Out[21]: first bar bar baz baz foo foo qux
second one two one two one two one
A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299
B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127
C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466

first qux
second two
A -0.226169
B -1.436737
C -2.006747

In [22]: pd.set_option('display.multi_sparse', True)

Reconstructing the level labels

The method get_level_values will return a vector of the labels for each location at a particular level:

In [23]: index.get_level_values(0) Out[23]: Index([u'bar', u'bar', u'baz', u'baz', u'foo', u'foo', u'qux', u'qux'], dtype='object', name=u'first')

In [24]: index.get_level_values('second') Out[24]: Index([u'one', u'two', u'one', u'two', u'one', u'two', u'one', u'two'], dtype='object', name=u'second')

Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partialselection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [25]: df['bar'] Out[25]: second one two A 0.895717 0.805244 B 0.410835 0.813850 C -1.413681 1.607920

In [26]: df['bar', 'one'] Out[26]: A 0.895717 B 0.410835 C -1.413681 Name: (bar, one), dtype: float64

In [27]: df['bar']['one'] Out[27]: A 0.895717 B 0.410835 C -1.413681 Name: one, dtype: float64

In [28]: s['qux'] Out[28]: one -1.039575 two 0.271860 dtype: float64

See Cross-section with hierarchical index for how to select on a deeper level.

Note

The repr of a MultiIndex shows ALL the defined levels of an index, even if the they are not actually used. When slicing an index, you may notice this. For example:

original multi-index

In [29]: df.columns Out[29]: MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']], labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]], names=[u'first', u'second'])

sliced

In [30]: df[['foo','qux']].columns Out[30]: MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']], labels=[[2, 2, 3, 3], [0, 1, 0, 1]], names=[u'first', u'second'])

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see the actual used levels.

In [31]: df[['foo','qux']].columns.values Out[31]: array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], dtype=object)

for a specific level

In [32]: df[['foo','qux']].columns.get_level_values(0) Out[32]: Index([u'foo', u'foo', u'qux', u'qux'], dtype='object', name=u'first')

To reconstruct the multiindex with only the used levels

In [33]: pd.MultiIndex.from_tuples(df[['foo','qux']].columns.values) Out[33]: MultiIndex(levels=[[u'foo', u'qux'], [u'one', u'two']], labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

Data alignment and using reindex

Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [34]: s + s[:-2] Out[34]: bar one -1.723698 two -4.209138 baz one -0.989859 two 2.143608 foo one 1.443110 two -1.413542 qux one NaN two NaN dtype: float64

In [35]: s + s[::2] Out[35]: bar one -1.723698 two NaN baz one -0.989859 two NaN foo one 1.443110 two NaN qux one -2.079150 two NaN dtype: float64

reindex can be called with another MultiIndex or even a list or array of tuples:

In [36]: s.reindex(index[:3]) Out[36]: first second bar one -0.861849 two -2.104569 baz one -0.494929 dtype: float64

In [37]: s.reindex([('foo', 'two'), ('bar', 'one'), ('qux', 'one'), ('baz', 'one')]) Out[37]: foo two -0.706771 bar one -0.861849 qux one -1.039575 baz one -0.494929 dtype: float64

Advanced indexing with hierarchical index

Syntactically integrating MultiIndex in advanced indexing with .loc/.ix is a bit challenging, but we’ve made every effort to do so. for example the following works as you would expect:

In [38]: df = df.T

In [39]: df Out[39]: A B C first second
bar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747

In [40]: df.loc['bar'] Out[40]: A B C second
one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920

In [41]: df.loc['bar', 'two'] Out[41]: A 0.805244 B 0.813850 C 1.607920 Name: (bar, two), dtype: float64

“Partial” slicing also works quite nicely.

In [42]: df.loc['baz':'foo'] Out[42]: A B C first second
baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372

You can slice with a ‘range’ of values, by providing a slice of tuples.

In [43]: df.loc[('baz', 'two'):('qux', 'one')] Out[43]: A B C first second
baz two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466

In [44]: df.loc[('baz', 'two'):'foo'] Out[44]: A B C first second
baz two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372

Passing a list of labels or tuples works similar to reindexing:

In [45]: df.ix[[('bar', 'two'), ('qux', 'one')]] Out[45]: A B C first second
bar two 0.805244 0.813850 1.607920 qux one -1.170299 1.130127 0.974466

Using slicers

New in version 0.14.0.

In 0.14.0 we added a new way to slice multi-indexed objects. You can slice a multi-index by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all the_deeper_ levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

Warning

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both axes, rather than into say the MuliIndex for the rows.

You should do this:

df.loc[(slice('A1','A3'),.....),:]

rather than this:

df.loc[(slice('A1','A3'),.....)]

Warning

You will need to make sure that the selection axes are fully lexsorted!

In [46]: def mklbl(prefix,n): ....: return ["%s%s" % (prefix,i) for i in range(n)] ....:

In [47]: miindex = pd.MultiIndex.from_product([mklbl('A',4), ....: mklbl('B',2), ....: mklbl('C',4), ....: mklbl('D',2)]) ....:

In [48]: micolumns = pd.MultiIndex.from_tuples([('a','foo'),('a','bar'), ....: ('b','foo'),('b','bah')], ....: names=['lvl0', 'lvl1']) ....:

In [49]: dfmi = pd.DataFrame(np.arange(len(miindex)*len(micolumns)).reshape((len(miindex),len(micolumns))), ....: index=miindex, ....: columns=micolumns).sort_index().sort_index(axis=1) ....:

In [50]: dfmi Out[50]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 25 24 27 26 ... ... ... ... ... A3 B1 C0 D1 229 228 231 230 C1 D0 233 232 235 234 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254

[64 rows x 4 columns]

Basic multi-index slicing using slices, lists, and labels.

In [51]: dfmi.loc[(slice('A1','A3'),slice(None), ['C1','C3']),:] Out[51]: lvl0 a b
lvl1 bar foo bah foo A1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106 D1 109 108 111 110 C3 D0 121 120 123 122 ... ... ... ... ... A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254

[24 rows x 4 columns]

You can use a pd.IndexSlice to have a more natural syntax using : rather than using slice(None)

In [52]: idx = pd.IndexSlice

In [53]: dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']] Out[53]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 D1 44 46 C3 D0 56 58 ... ... ... A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [54]: dfmi.loc['A1',(slice(None),'foo')] Out[54]: lvl0 a b lvl1 foo foo B0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82 D1 84 86 C3 D0 88 90 ... ... ... B1 C0 D1 100 102 C1 D0 104 106 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126

[16 rows x 2 columns]

In [55]: dfmi.loc[idx[:,:,['C1','C3']],idx[:,'foo']] Out[55]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 D1 44 46 C3 D0 56 58 ... ... ... A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to the values.

In [56]: mask = dfmi[('a','foo')]>200

In [57]: dfmi.loc[idx[mask,:,['C1','C3']],idx[:,'foo']] Out[57]: lvl0 a b lvl1 foo foo A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

In [58]: dfmi.loc(axis=0)[:,:,['C1','C3']] Out[58]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42 D1 45 44 47 46 C3 D0 57 56 59 58 ... ... ... ... ... A3 B0 C1 D1 205 204 207 206 C3 D0 217 216 219 218 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254

[32 rows x 4 columns]

Furthermore you can set the values using these methods

In [59]: df2 = dfmi.copy()

In [60]: df2.loc(axis=0)[:,:,['C1','C3']] = -10

In [61]: df2 Out[61]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 -10 -10 -10 -10 ... ... ... ... ... A3 B1 C0 D1 229 228 231 230 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10

[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [62]: df2 = dfmi.copy()

In [63]: df2.loc[idx[:,:,['C1','C3']],:] = df2*1000

In [64]: df2 Out[64]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18 D1 21 20 23 22 C3 D0 25000 24000 27000 26000 ... ... ... ... ... A3 B1 C0 D1 229 228 231 230 C1 D0 233000 232000 235000 234000 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000

[64 rows x 4 columns]

Cross-section

The xs method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

In [65]: df Out[65]: A B C first second
bar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747

In [66]: df.xs('one', level='second') Out[66]: A B C first
bar 0.895717 0.410835 -1.413681 baz -1.206412 0.132003 1.024180 foo 1.431256 -0.076467 0.875906 qux -1.170299 1.130127 0.974466

using the slicers (new in 0.14.0)

In [67]: df.loc[(slice(None),'one'),:] Out[67]: A B C first second
bar one 0.895717 0.410835 -1.413681 baz one -1.206412 0.132003 1.024180 foo one 1.431256 -0.076467 0.875906 qux one -1.170299 1.130127 0.974466

You can also select on the columns with xs(), by providing the axis argument

In [68]: df = df.T

In [69]: df.xs('one', level='second', axis=1) Out[69]: first bar baz foo qux A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466

using the slicers (new in 0.14.0)

In [70]: df.loc[:,(slice(None),'one')] Out[70]: first bar baz foo qux second one one one one A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466

xs() also allows selection with multiple keys

In [71]: df.xs(('one', 'bar'), level=('second', 'first'), axis=1) Out[71]: first bar second one A 0.895717 B 0.410835 C -1.413681

using the slicers (new in 0.14.0)

In [72]: df.loc[:,('bar','one')] Out[72]: A 0.895717 B 0.410835 C -1.413681 Name: (bar, one), dtype: float64

New in version 0.13.0.

You can pass drop_level=False to xs() to retain the level that was selected

In [73]: df.xs('one', level='second', axis=1, drop_level=False) Out[73]: first bar baz foo qux second one one one one A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466

versus the result with drop_level=True (the default value)

In [74]: df.xs('one', level='second', axis=1, drop_level=True) Out[74]: first bar baz foo qux A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466

Advanced reindexing and alignment

The parameter level has been added to the reindex and align methods of pandas objects. This is useful to broadcast values across a level. For instance:

In [75]: midx = pd.MultiIndex(levels=[['zero', 'one'], ['x','y']], ....: labels=[[1,1,0,0],[1,0,1,0]]) ....:

In [76]: df = pd.DataFrame(np.random.randn(4,2), index=midx)

In [77]: df Out[77]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520

In [78]: df2 = df.mean(level=0)

In [79]: df2 Out[79]: 0 1 zero 1.271532 0.713416 one 1.060074 -0.109716

In [80]: df2.reindex(df.index, level=0) Out[80]: 0 1 one y 1.060074 -0.109716 x 1.060074 -0.109716 zero y 1.271532 0.713416 x 1.271532 0.713416

aligning

In [81]: df_aligned, df2_aligned = df.align(df2, level=0)

In [82]: df_aligned Out[82]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520

In [83]: df2_aligned Out[83]: 0 1 one y 1.060074 -0.109716 x 1.060074 -0.109716 zero y 1.271532 0.713416 x 1.271532 0.713416

Swapping levels with swaplevel()

The swaplevel function can switch the order of two levels:

In [84]: df[:5] Out[84]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520

In [85]: df[:5].swaplevel(0, 1, axis=0) Out[85]: 0 1 y one 1.519970 -0.493662 x one 0.600178 0.274230 y zero 0.132885 -0.023688 x zero 2.410179 1.450520

Reordering levels with reorder_levels()

The reorder_levels function generalizes the swaplevel function, allowing you to permute the hierarchical index levels in one step:

In [86]: df[:5].reorder_levels([1,0], axis=0) Out[86]: 0 1 y one 1.519970 -0.493662 x one 0.600178 0.274230 y zero 0.132885 -0.023688 x zero 2.410179 1.450520

The need for sortedness with MultiIndex

Caveat emptor: the present implementation of MultiIndex requires that the labels be sorted for some of the slicing / indexing routines to work correctly. You can think about breaking the axis into unique groups, where at the hierarchical level of interest, each distinct group shares a label, but no two have the same label. However, the MultiIndex does not enforce this:you are responsible for ensuring that things are properly sorted. There is an important new method sort_index to sort an axis within a MultiIndexso that its labels are grouped and sorted by the original ordering of the associated factor at that level. Note that this does not necessarily mean the labels will be sorted lexicographically!

In [87]: import random; random.shuffle(tuples)

In [88]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [89]: s Out[89]: qux one 0.206053 two -0.251905 foo one -2.213588 two 1.063327 baz two 1.266143 one 0.299368 bar one -0.863838 two 0.408204 dtype: float64

In [90]: s.sort_index(level=0) Out[90]: bar one -0.863838 two 0.408204 baz one 0.299368 two 1.266143 foo one -2.213588 two 1.063327 qux one 0.206053 two -0.251905 dtype: float64

In [91]: s.sort_index(level=1) Out[91]: bar one -0.863838 baz one 0.299368 foo one -2.213588 qux one 0.206053 bar two 0.408204 baz two 1.266143 foo two 1.063327 qux two -0.251905 dtype: float64

Note, you may also pass a level name to sort_index if the MultiIndex levels are named.

In [92]: s.index.set_names(['L1', 'L2'], inplace=True)

In [93]: s.sort_index(level='L1') Out[93]: L1 L2 bar one -0.863838 two 0.408204 baz one 0.299368 two 1.266143 foo one -2.213588 two 1.063327 qux one 0.206053 two -0.251905 dtype: float64

In [94]: s.sort_index(level='L2') Out[94]: L1 L2 bar one -0.863838 baz one 0.299368 foo one -2.213588 qux one 0.206053 bar two 0.408204 baz two 1.266143 foo two 1.063327 qux two -0.251905 dtype: float64

Some indexing will work even if the data are not sorted, but will be rather inefficient and will also return a copy of the data rather than a view:

In [95]: s['qux'] Out[95]: L2 one 0.206053 two -0.251905 dtype: float64

In [96]: s.sort_index(level=1)['qux'] Out[96]: L2 one 0.206053 two -0.251905 dtype: float64

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:

In [97]: df.T.sort_index(level=1, axis=1) Out[97]: zero one zero one x x y y 0 2.410179 0.600178 0.132885 1.519970 1 1.450520 0.274230 -0.023688 -0.493662

The MultiIndex object has code to explicitly check the sort depth. Thus, if you try to index at a depth at which the index is not sorted, it will raise an exception. Here is a concrete example to illustrate this:

In [98]: tuples = [('a', 'a'), ('a', 'b'), ('b', 'a'), ('b', 'b')]

In [99]: idx = pd.MultiIndex.from_tuples(tuples)

In [100]: idx.lexsort_depth Out[100]: 2

In [101]: reordered = idx[[1, 0, 3, 2]]

In [102]: reordered.lexsort_depth Out[102]: 1

In [103]: s = pd.Series(np.random.randn(4), index=reordered)

In [104]: s.ix['a':'a'] Out[104]: a b -1.048089 a -0.025747 dtype: float64

However:

s.ix[('a', 'b'):('b', 'a')] Traceback (most recent call last) ... KeyError: Key length (3) was greater than MultiIndex lexsort depth (2)

Take Methods

Similar to numpy ndarrays, pandas Index, Series, and DataFrame also provides the take method that retrieves elements along a given axis at the given indices. The given indices must be either a list or an ndarray of integer index positions. take will also accept negative integers as relative positions to the end of the object.

In [105]: index = pd.Index(np.random.randint(0, 1000, 10))

In [106]: index Out[106]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [107]: positions = [0, 9, 3]

In [108]: index[positions] Out[108]: Int64Index([214, 329, 567], dtype='int64')

In [109]: index.take(positions) Out[109]: Int64Index([214, 329, 567], dtype='int64')

In [110]: ser = pd.Series(np.random.randn(10))

In [111]: ser.iloc[positions] Out[111]: 0 -0.179666 9 1.824375 3 0.392149 dtype: float64

In [112]: ser.take(positions) Out[112]: 0 -0.179666 9 1.824375 3 0.392149 dtype: float64

For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.

In [113]: frm = pd.DataFrame(np.random.randn(5, 3))

In [114]: frm.take([1, 4, 3]) Out[114]: 0 1 2 1 -1.237881 0.106854 -1.276829 4 0.629675 -1.425966 1.857704 3 0.979542 -1.633678 0.615855

In [115]: frm.take([0, 2], axis=1) Out[115]: 0 2 0 0.595974 0.601544 1 -1.237881 -1.276829 2 -0.767101 1.499591 3 0.979542 0.615855 4 0.629675 1.857704

It is important to note that the take method on pandas objects are not intended to work on boolean indices and may return unexpected results.

In [116]: arr = np.random.randn(10)

In [117]: arr.take([False, False, True, True]) Out[117]: array([-1.1935, -1.1935, 0.6775, 0.6775])

In [118]: arr[[0, 1]] Out[118]: array([-1.1935, 0.6775])

In [119]: ser = pd.Series(np.random.randn(10))

In [120]: ser.take([False, False, True, True]) Out[120]: 0 0.233141 0 0.233141 1 -0.223540 1 -0.223540 dtype: float64

In [121]: ser.ix[[0, 1]] Out[121]: 0 0.233141 1 -0.223540 dtype: float64

Finally, as a small note on performance, because the take method handles a narrower range of inputs, it can offer performance that is a good deal faster than fancy indexing.

Index Types

We have discussed MultiIndex in the previous sections pretty extensively. DatetimeIndex and PeriodIndexare shown here. TimedeltaIndex are here.

In the following sub-sections we will highlite some other index types.

CategoricalIndex

New in version 0.16.1.

We introduce a CategoricalIndex, a new type of index object that is useful for supporting indexing with duplicates. This is a container around a Categorical (introduced in v0.15.0) and allows efficient indexing and storage of an index with a large number of duplicated elements. Prior to 0.16.1, setting the index of a DataFrame/Series with a category dtype would convert this to regular object-based Index.

In [122]: df = pd.DataFrame({'A': np.arange(6), .....: 'B': list('aabbca')}) .....:

In [123]: df['B'] = df['B'].astype('category', categories=list('cab'))

In [124]: df Out[124]: A B 0 0 a 1 1 a 2 2 b 3 3 b 4 4 c 5 5 a

In [125]: df.dtypes Out[125]: A int64 B category dtype: object

In [126]: df.B.cat.categories Out[126]: Index([u'c', u'a', u'b'], dtype='object')

Setting the index, will create create a CategoricalIndex

In [127]: df2 = df.set_index('B')

In [128]: df2.index Out[128]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')

Indexing with __getitem__/.iloc/.loc/.ix works similarly to an Index with duplicates. The indexers MUST be in the category or the operation will raise.

In [129]: df2.loc['a'] Out[129]: A B
a 0 a 1 a 5

These PRESERVE the CategoricalIndex

In [130]: df2.loc['a'].index Out[130]: CategoricalIndex([u'a', u'a', u'a'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')

Sorting will order by the order of the categories

In [131]: df2.sort_index() Out[131]: A B
c 4 a 0 a 1 a 5 b 2 b 3

Groupby operations on the index will preserve the index nature as well

In [132]: df2.groupby(level=0).sum() Out[132]: A B
c 4 a 6 b 5

In [133]: df2.groupby(level=0).sum().index Out[133]: CategoricalIndex([u'c', u'a', u'b'], categories=[u'c', u'a', u'b'], ordered=False, name=u'B', dtype='category')

Reindexing operations, will return a resulting index based on the type of the passed indexer, meaning that passing a list will return a plain-old-Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the PASSED Categorical dtype. This allows one to arbitrarly index these even with values NOT in the categories, similarly to how you can reindex ANY pandas index.

In [134]: df2.reindex(['a','e']) Out[134]: A B
a 0.0 a 1.0 a 5.0 e NaN

In [135]: df2.reindex(['a','e']).index Out[135]: Index([u'a', u'a', u'a', u'e'], dtype='object', name=u'B')

In [136]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))) Out[136]: A B
a 0.0 a 1.0 a 5.0 e NaN

In [137]: df2.reindex(pd.Categorical(['a','e'],categories=list('abcde'))).index Out[137]: CategoricalIndex([u'a', u'a', u'a', u'e'], categories=[u'a', u'b', u'c', u'd', u'e'], ordered=False, name=u'B', dtype='category')

Warning

Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.

In [9]: df3 = pd.DataFrame({'A' : np.arange(6), 'B' : pd.Series(list('aabbca')).astype('category')})

In [11]: df3 = df3.set_index('B')

In [11]: df3.index Out[11]: CategoricalIndex([u'a', u'a', u'b', u'b', u'c', u'a'], categories=[u'a', u'b', u'c'], ordered=False, name=u'B', dtype='category')

In [12]: pd.concat([df2, df3] TypeError: categories must match existing categories when appending

Int64Index and RangeIndex

Warning

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see here.

Int64Index is a fundamental basic index in pandas. This is an Immutable array implementing an ordered, sliceable set. Prior to 0.18.0, the Int64Index would provide the default index for all NDFrame objects.

RangeIndex is a sub-class of Int64Index added in version 0.18.0, now providing the default index for all NDFrame objects.RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are analagous to python range types.

Float64Index

Note

As of 0.14.0, Float64Index is backed by a native float64 dtype array. Prior to 0.14.0, Float64Index was backed by an object dtype array. Using a float64 dtype in the backend speeds up arithmetic operations by about 30x and boolean indexing operations on theFloat64Index itself are about 2x as fast.

New in version 0.13.0.

By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values in index creation. This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and slicing work exactly the same.

In [138]: indexf = pd.Index([1.5, 2, 3, 4.5, 5])

In [139]: indexf Out[139]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')

In [140]: sf = pd.Series(range(5), index=indexf)

In [141]: sf Out[141]: 1.5 0 2.0 1 3.0 2 4.5 3 5.0 4 dtype: int64

Scalar selection for [],.ix,.loc will always be label based. An integer will match an equal float index (e.g. 3 is equivalent to 3.0)

In [142]: sf[3] Out[142]: 2

In [143]: sf[3.0] Out[143]: 2

In [144]: sf.ix[3] Out[144]: 2

In [145]: sf.ix[3.0] Out[145]: 2

In [146]: sf.loc[3] Out[146]: 2

In [147]: sf.loc[3.0] Out[147]: 2

The only positional indexing is via iloc

In [148]: sf.iloc[3] Out[148]: 3

A scalar index that is not found will raise KeyError

Slicing is ALWAYS on the values of the index, for [],ix,loc and ALWAYS positional with iloc

In [149]: sf[2:4] Out[149]: 2.0 1 3.0 2 dtype: int64

In [150]: sf.ix[2:4] Out[150]: 2.0 1 3.0 2 dtype: int64

In [151]: sf.loc[2:4] Out[151]: 2.0 1 3.0 2 dtype: int64

In [152]: sf.iloc[2:4] Out[152]: 3.0 2 4.5 3 dtype: int64

In float indexes, slicing using floats is allowed

In [153]: sf[2.1:4.6] Out[153]: 3.0 2 4.5 3 dtype: int64

In [154]: sf.loc[2.1:4.6] Out[154]: 3.0 2 4.5 3 dtype: int64

In non-float indexes, slicing using floats will raise a TypeError

In [1]: pd.Series(range(5))[3.5] TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)

In [1]: pd.Series(range(5))[3.5:4.5] TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)

Warning

Using a scalar float indexer for .iloc has been removed in 0.18.0, so the following will raise a TypeError

In [3]: pd.Series(range(5)).iloc[3.0] TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>

Further the treatment of .ix with a float indexer on a non-float index, will be label based, and thus coerce the index.

In [155]: s2 = pd.Series([1, 2, 3], index=list('abc'))

In [156]: s2 Out[156]: a 1 b 2 c 3 dtype: int64

In [157]: s2.ix[1.0] = 10

In [158]: s2 Out[158]: a 1 b 2 c 3 1.0 10 dtype: int64

Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like indexing scheme, but the data is recorded as floats. This could for example be millisecond offsets.

In [159]: dfir = pd.concat([pd.DataFrame(np.random.randn(5,2), .....: index=np.arange(5) * 250.0, .....: columns=list('AB')), .....: pd.DataFrame(np.random.randn(6,2), .....: index=np.arange(4,10) * 250.1, .....: columns=list('AB'))]) .....:

In [160]: dfir Out[160]: A B 0.0 0.997289 -1.693316 250.0 -0.179129 -1.598062 500.0 0.936914 0.912560 750.0 -1.003401 1.632781 1000.0 -0.724626 0.178219 1000.4 0.310610 -0.108002 1250.5 -0.974226 -1.147708 1500.6 -2.281374 0.760010 1750.7 -0.742532 1.533318 2000.8 2.495362 -0.432771 2250.9 -0.068954 0.043520

Selection operations then will always work on a value basis, for all selection operators.

In [161]: dfir[0:1000.4] Out[161]: A B 0.0 0.997289 -1.693316 250.0 -0.179129 -1.598062 500.0 0.936914 0.912560 750.0 -1.003401 1.632781 1000.0 -0.724626 0.178219 1000.4 0.310610 -0.108002

In [162]: dfir.loc[0:1001,'A'] Out[162]: 0.0 0.997289 250.0 -0.179129 500.0 0.936914 750.0 -1.003401 1000.0 -0.724626 1000.4 0.310610 Name: A, dtype: float64

In [163]: dfir.loc[1000.4] Out[163]: A 0.310610 B -0.108002 Name: 1000.4, dtype: float64

You could then easily pick out the first 1 second (1000 ms) of data then.

In [164]: dfir[0:1000] Out[164]: A B 0.0 0.997289 -1.693316 250.0 -0.179129 -1.598062 500.0 0.936914 0.912560 750.0 -1.003401 1.632781 1000.0 -0.724626 0.178219

Of course if you need integer based selection, then use iloc

In [165]: dfir.iloc[0:5] Out[165]: A B 0.0 0.997289 -1.693316 250.0 -0.179129 -1.598062 500.0 0.936914 0.912560 750.0 -1.003401 1.632781 1000.0 -0.724626 0.178219