advanced indexing — pandas 2.2.3 documentation (original) (raw)

MultiIndex / advanced indexing#

This section covers indexing with a MultiIndexand other advanced indexing features.

See the Indexing and Selecting Data for general indexing documentation.

Warning

Whether a copy or a reference is returned for a setting operation may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.

See the cookbook for some advanced strategies.

Hierarchical indexing (MultiIndex)#

Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.

See the cookbook for some advanced strategies.

Creating a MultiIndex (hierarchical index) object#

The MultiIndex object is the hierarchical analogue of the standardIndex object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. AMultiIndex can be created from a list of arrays (usingMultiIndex.from_arrays()), an array of tuples (usingMultiIndex.from_tuples()), a crossed set of iterables (usingMultiIndex.from_product()), or a DataFrame (usingMultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [1]: arrays = [ ...: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"], ...: ["one", "two", "one", "two", "one", "two", "one", "two"], ...: ] ...:

In [2]: tuples = list(zip(*arrays))

In [3]: tuples Out[3]: [('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]

In [4]: index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

In [5]: index Out[5]: MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second'])

In [6]: s = pd.Series(np.random.randn(8), index=index)

In [7]: s Out[7]: first second bar one 0.469112 two -0.282863 baz one -1.509059 two -1.135632 foo one 1.212112 two -0.173215 qux one 0.119209 two -1.044236 dtype: float64

When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:

In [8]: iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

In [9]: pd.MultiIndex.from_product(iterables, names=["first", "second"]) Out[9]: MultiIndex([('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], names=['first', 'second'])

You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.from_frame(). This is a complementary method toMultiIndex.to_frame().

In [10]: df = pd.DataFrame( ....: [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]], ....: columns=["first", "second"], ....: ) ....:

In [11]: pd.MultiIndex.from_frame(df) Out[11]: MultiIndex([('bar', 'one'), ('bar', 'two'), ('foo', 'one'), ('foo', 'two')], names=['first', 'second'])

As a convenience, you can pass a list of arrays directly into Series orDataFrame to construct a MultiIndex automatically:

In [12]: arrays = [ ....: np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]), ....: np.array(["one", "two", "one", "two", "one", "two", "one", "two"]), ....: ] ....:

In [13]: s = pd.Series(np.random.randn(8), index=arrays)

In [14]: s Out[14]: bar one -0.861849 two -2.104569 baz one -0.494929 two 1.071804 foo one 0.721555 two -0.706771 qux one -1.039575 two 0.271860 dtype: float64

In [15]: df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

In [16]: df Out[16]: 0 1 2 3 bar one -0.424972 0.567020 0.276232 -1.087401 two -0.673690 0.113648 -1.478427 0.524988 baz one 0.404705 0.577046 -1.715002 -1.039268 two -0.370647 -1.157892 -1.344312 0.844885 foo one 1.075770 -0.109050 1.643563 -1.469388 two 0.357021 -0.674600 -1.776904 -0.968914 qux one -1.294524 0.413738 0.276662 -0.472035 two -0.013960 -0.362543 -0.006154 -0.923061

All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

In [17]: df.index.names Out[17]: FrozenList([None, None])

This index can back any axis of a pandas object, and the number of levelsof the index is up to you:

In [18]: df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

In [19]: df Out[19]: first bar baz ... foo qux
second one two one ... two one two A 0.895717 0.805244 -1.206412 ... 1.340309 -1.170299 -0.226169 B 0.410835 0.813850 0.132003 ... -1.187678 1.130127 -1.436737 C -1.413681 1.607920 1.024180 ... -2.211372 0.974466 -2.006747

[3 rows x 8 columns]

In [20]: pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6]) Out[20]: first bar baz foo
second one two one two one two first second
bar one -0.410001 -0.078638 0.545952 -1.219217 -1.226825 0.769804 two -1.281247 -0.727707 -0.121306 -0.097883 0.695775 0.341734 baz one 0.959726 -1.110336 -0.619976 0.149748 -0.732339 0.687738 two 0.176444 0.403310 -0.154951 0.301624 -2.179861 -1.369849 foo one -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 two 0.690579 0.995761 2.396780 0.014871 3.357427 -0.317441

We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using themulti_sparse option in pandas.set_options():

In [21]: with pd.option_context("display.multi_sparse", False): ....: df ....:

It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

In [22]: pd.Series(np.random.randn(8), index=tuples) Out[22]: (bar, one) -1.236269 (bar, two) 0.896171 (baz, one) -0.487602 (baz, two) -0.082240 (foo, one) -2.182937 (foo, two) 0.380396 (qux, one) 0.084844 (qux, two) 0.432390 dtype: float64

The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating aMultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

Reconstructing the level labels#

The method get_level_values() will return a vector of the labels for each location at a particular level:

In [23]: index.get_level_values(0) Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [24]: index.get_level_values("second") Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

Basic indexing on axis with MultiIndex#

One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partialselection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [25]: df["bar"] Out[25]: second one two A 0.895717 0.805244 B 0.410835 0.813850 C -1.413681 1.607920

In [26]: df["bar", "one"] Out[26]: A 0.895717 B 0.410835 C -1.413681 Name: (bar, one), dtype: float64

In [27]: df["bar"]["one"] Out[27]: A 0.895717 B 0.410835 C -1.413681 Name: one, dtype: float64

In [28]: s["qux"] Out[28]: one -1.039575 two 0.271860 dtype: float64

See Cross-section with hierarchical index for how to select on a deeper level.

Defined levels#

The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. For example:

In [29]: df.columns.levels # original MultiIndex Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [30]: df[["foo","qux"]].columns.levels # sliced Out[30]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use theget_level_values() method.

In [31]: df[["foo", "qux"]].columns.to_numpy() Out[31]: array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')], dtype=object)

for a specific level

In [32]: df[["foo", "qux"]].columns.get_level_values(0) Out[32]: Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

To reconstruct the MultiIndex with only the used levels, theremove_unused_levels() method may be used.

In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()

In [34]: new_mi.levels Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])

Data alignment and using `reindex`#

Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [35]: s + s[:-2] Out[35]: bar one -1.723698 two -4.209138 baz one -0.989859 two 2.143608 foo one 1.443110 two -1.413542 qux one NaN two NaN dtype: float64

In [36]: s + s[::2] Out[36]: bar one -1.723698 two NaN baz one -0.989859 two NaN foo one 1.443110 two NaN qux one -2.079150 two NaN dtype: float64

The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array of tuples:

In [37]: s.reindex(index[:3]) Out[37]: first second bar one -0.861849 two -2.104569 baz one -0.494929 dtype: float64

In [38]: s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")]) Out[38]: foo two -0.706771 bar one -0.861849 qux one -1.039575 baz one -0.494929 dtype: float64

Advanced indexing with hierarchical index#

Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

In [39]: df = df.T

In [40]: df Out[40]: A B C first second
bar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747

In [41]: df.loc[("bar", "two")] Out[41]: A 0.805244 B 0.813850 C 1.607920 Name: (bar, two), dtype: float64

Note that df.loc['bar', 'two'] would also work in this example, but this shorthand notation can lead to ambiguity in general.

If you also want to index a specific column with .loc, you must use a tuple like this:

In [42]: df.loc[("bar", "two"), "A"] Out[42]: 0.8052440253863785

You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For example, you can use “partial” indexing to get all elements with bar in the first level as follows:

In [43]: df.loc["bar"] Out[43]: A B C second
one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920

This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',] in this example).

“Partial” slicing also works quite nicely.

In [44]: df.loc["baz":"foo"] Out[44]: A B C first second
baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372

You can slice with a ‘range’ of values, by providing a slice of tuples.

In [45]: df.loc[("baz", "two"):("qux", "one")] Out[45]: A B C first second
baz two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466

In [46]: df.loc[("baz", "two"):"foo"] Out[46]: A B C first second
baz two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372

Passing a list of labels or tuples works similar to reindexing:

In [47]: df.loc[[("bar", "two"), ("qux", "one")]] Out[47]: A B C first second
bar two 0.805244 0.813850 1.607920 qux one -1.170299 1.130127 0.974466

Note

It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several values within a level:

In [48]: s = pd.Series( ....: [1, 2, 3, 4, 5, 6], ....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]), ....: ) ....:

In [49]: s.loc[[("A", "c"), ("B", "d")]] # list of tuples Out[49]: A c 1 B d 5 dtype: int64

In [50]: s.loc[(["A", "B"], ["c", "d"])] # tuple of lists Out[50]: A c 1 d 2 B c 4 d 5 dtype: int64

Using slicers#

You can slice a MultiIndex by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.

You can use slice(None) to select all the contents of that level. You do not need to specify all the_deeper_ levels, they will be implied as slice(None).

As usual, both sides of the slicers are included as this is label indexing.

Warning

You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be misinterpreted as indexing both axes, rather than into say the MultiIndex for the rows.

You should do this:

df.loc[(slice("A1", "A3"), ...), :] # noqa: E999

You should not do this:

df.loc[(slice("A1", "A3"), ...)] # noqa: E999

In [51]: def mklbl(prefix, n): ....: return ["%s%s" % (prefix, i) for i in range(n)] ....:

In [52]: miindex = pd.MultiIndex.from_product( ....: [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)] ....: ) ....:

In [53]: micolumns = pd.MultiIndex.from_tuples( ....: [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"] ....: ) ....:

In [54]: dfmi = ( ....: pd.DataFrame( ....: np.arange(len(miindex) * len(micolumns)).reshape( ....: (len(miindex), len(micolumns)) ....: ), ....: index=miindex, ....: columns=micolumns, ....: ) ....: .sort_index() ....: .sort_index(axis=1) ....: ) ....:

In [55]: dfmi Out[55]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9 8 11 10 D1 13 12 15 14 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 237 236 239 238 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249 248 251 250 D1 253 252 255 254

[64 rows x 4 columns]

Basic MultiIndex slicing using slices, lists, and labels.

In [56]: dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :] Out[56]: lvl0 a b
lvl1 bar foo bah foo A1 B0 C1 D0 73 72 75 74 D1 77 76 79 78 C3 D0 89 88 91 90 D1 93 92 95 94 B1 C1 D0 105 104 107 106 ... ... ... ... ... A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254

[24 rows x 4 columns]

You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).

In [57]: idx = pd.IndexSlice

In [58]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]] Out[58]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 ... ... ... A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

[32 rows x 2 columns]

It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [59]: dfmi.loc["A1", (slice(None), "foo")] Out[59]: lvl0 a b lvl1 foo foo B0 C0 D0 64 66 D1 68 70 C1 D0 72 74 D1 76 78 C2 D0 80 82 ... ... ... B1 C1 D1 108 110 C2 D0 112 114 D1 116 118 C3 D0 120 122 D1 124 126

[16 rows x 2 columns]

In [60]: dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]] Out[60]: lvl0 a b lvl1 foo foo A0 B0 C1 D0 8 10 D1 12 14 C3 D0 24 26 D1 28 30 B1 C1 D0 40 42 ... ... ... A3 B0 C3 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

[32 rows x 2 columns]

Using a boolean indexer you can provide selection related to the values.

In [61]: mask = dfmi[("a", "foo")] > 200

In [62]: dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]] Out[62]: lvl0 a b lvl1 foo foo A3 B0 C1 D1 204 206 C3 D0 216 218 D1 220 222 B1 C1 D0 232 234 D1 236 238 C3 D0 248 250 D1 252 254

You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]] Out[63]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C1 D0 9 8 11 10 D1 13 12 15 14 C3 D0 25 24 27 26 D1 29 28 31 30 B1 C1 D0 41 40 43 42 ... ... ... ... ... A3 B0 C3 D1 221 220 223 222 B1 C1 D0 233 232 235 234 D1 237 236 239 238 C3 D0 249 248 251 250 D1 253 252 255 254

[32 rows x 4 columns]

Furthermore, you can set the values using the following methods.

In [64]: df2 = dfmi.copy()

In [65]: df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10

In [66]: df2 Out[66]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 -10 -10 -10 -10 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 -10 -10 -10 -10 D1 -10 -10 -10 -10

[64 rows x 4 columns]

You can use a right-hand-side of an alignable object as well.

In [67]: df2 = dfmi.copy()

In [68]: df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000

In [69]: df2 Out[69]: lvl0 a b
lvl1 bar foo bah foo A0 B0 C0 D0 1 0 3 2 D1 5 4 7 6 C1 D0 9000 8000 11000 10000 D1 13000 12000 15000 14000 C2 D0 17 16 19 18 ... ... ... ... ... A3 B1 C1 D1 237000 236000 239000 238000 C2 D0 241 240 243 242 D1 245 244 247 246 C3 D0 249000 248000 251000 250000 D1 253000 252000 255000 254000

[64 rows x 4 columns]

Cross-section#

The xs() method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

In [70]: df Out[70]: A B C first second
bar one 0.895717 0.410835 -1.413681 two 0.805244 0.813850 1.607920 baz one -1.206412 0.132003 1.024180 two 2.565646 -0.827317 0.569605 foo one 1.431256 -0.076467 0.875906 two 1.340309 -1.187678 -2.211372 qux one -1.170299 1.130127 0.974466 two -0.226169 -1.436737 -2.006747

In [71]: df.xs("one", level="second") Out[71]: A B C first
bar 0.895717 0.410835 -1.413681 baz -1.206412 0.132003 1.024180 foo 1.431256 -0.076467 0.875906 qux -1.170299 1.130127 0.974466

using the slicers

In [72]: df.loc[(slice(None), "one"), :] Out[72]: A B C first second
bar one 0.895717 0.410835 -1.413681 baz one -1.206412 0.132003 1.024180 foo one 1.431256 -0.076467 0.875906 qux one -1.170299 1.130127 0.974466

You can also select on the columns with xs, by providing the axis argument.

In [73]: df = df.T

In [74]: df.xs("one", level="second", axis=1) Out[74]: first bar baz foo qux A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466

using the slicers

In [75]: df.loc[:, (slice(None), "one")] Out[75]: first bar baz foo qux second one one one one A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466

xs also allows selection with multiple keys.

In [76]: df.xs(("one", "bar"), level=("second", "first"), axis=1) Out[76]: first bar second one A 0.895717 B 0.410835 C -1.413681

using the slicers

In [77]: df.loc[:, ("bar", "one")] Out[77]: A 0.895717 B 0.410835 C -1.413681 Name: (bar, one), dtype: float64

You can pass drop_level=False to xs to retain the level that was selected.

In [78]: df.xs("one", level="second", axis=1, drop_level=False) Out[78]: first bar baz foo qux second one one one one A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466

Compare the above with the result using drop_level=True (the default value).

In [79]: df.xs("one", level="second", axis=1, drop_level=True) Out[79]: first bar baz foo qux A 0.895717 -1.206412 1.431256 -1.170299 B 0.410835 0.132003 -0.076467 1.130127 C -1.413681 1.024180 0.875906 0.974466

Advanced reindexing and alignment#

Using the parameter level in the reindex() andalign() methods of pandas objects is useful to broadcast values across a level. For instance:

In [80]: midx = pd.MultiIndex( ....: levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]] ....: ) ....:

In [81]: df = pd.DataFrame(np.random.randn(4, 2), index=midx)

In [82]: df Out[82]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520

In [83]: df2 = df.groupby(level=0).mean()

In [84]: df2 Out[84]: 0 1 one 1.060074 -0.109716 zero 1.271532 0.713416

In [85]: df2.reindex(df.index, level=0) Out[85]: 0 1 one y 1.060074 -0.109716 x 1.060074 -0.109716 zero y 1.271532 0.713416 x 1.271532 0.713416

aligning

In [86]: df_aligned, df2_aligned = df.align(df2, level=0)

In [87]: df_aligned Out[87]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520

In [88]: df2_aligned Out[88]: 0 1 one y 1.060074 -0.109716 x 1.060074 -0.109716 zero y 1.271532 0.713416 x 1.271532 0.713416

Swapping levels with `swaplevel`#

The swaplevel() method can switch the order of two levels:

In [89]: df[:5] Out[89]: 0 1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520

In [90]: df[:5].swaplevel(0, 1, axis=0) Out[90]: 0 1 y one 1.519970 -0.493662 x one 0.600178 0.274230 y zero 0.132885 -0.023688 x zero 2.410179 1.450520

Reordering levels with `reorder_levels`#

The reorder_levels() method generalizes the swaplevelmethod, allowing you to permute the hierarchical index levels in one step:

In [91]: df[:5].reorder_levels([1, 0], axis=0) Out[91]: 0 1 y one 1.519970 -0.493662 x one 0.600178 0.274230 y zero 0.132885 -0.023688 x zero 2.410179 1.450520

Renaming names of an `Index` or `MultiIndex`#

The rename() method is used to rename the labels of aMultiIndex, and is typically used to rename the columns of a DataFrame. The columns argument of rename allows a dictionary to be specified that includes only the columns you wish to rename.

In [92]: df.rename(columns={0: "col0", 1: "col1"}) Out[92]: col0 col1 one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520

This method can also be used to rename specific labels of the main index of the DataFrame.

In [93]: df.rename(index={"one": "two", "y": "z"}) Out[93]: 0 1 two z 1.519970 -0.493662 x 0.600178 0.274230 zero z 0.132885 -0.023688 x 2.410179 1.450520

The rename_axis() method is used to rename the name of aIndex or MultiIndex. In particular, the names of the levels of aMultiIndex can be specified, which is useful if reset_index() is later used to move the values from the MultiIndex to a column.

In [94]: df.rename_axis(index=["abc", "def"]) Out[94]: 0 1 abc def
one y 1.519970 -0.493662 x 0.600178 0.274230 zero y 0.132885 -0.023688 x 2.410179 1.450520

Note that the columns of a DataFrame are an index, so that usingrename_axis with the columns argument will change the name of that index.

In [95]: df.rename_axis(columns="Cols").columns Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')

Both rename and rename_axis support specifying a dictionary,Series or a mapping function to map labels/names to new values.

When working with an Index object directly, rather than via a DataFrame,Index.set_names() can be used to change the names.

In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])

In [97]: mi.names Out[97]: FrozenList(['x', 'y'])

In [98]: mi2 = mi.rename("new name", level=0)

In [99]: mi2 Out[99]: MultiIndex([(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')], names=['new name', 'y'])

You cannot set the names of the MultiIndex via a level.

In [100]: mi.levels[0].name = "name via level"

RuntimeError Traceback (most recent call last) Cell In[100], line 1 ----> 1 mi.levels[0].name = "name via level"

File ~/work/pandas/pandas/pandas/core/indexes/base.py:1690, in Index.name(self, value) 1686 @name.setter 1687 def name(self, value: Hashable) -> None: 1688 if self._no_setting_name: 1689 # Used in MultiIndex.levels to avoid silently ignoring name updates. -> 1690 raise RuntimeError( 1691 "Cannot set name on a level of a MultiIndex. Use " 1692 "'MultiIndex.set_names' instead." 1693 ) 1694 maybe_extract_name(value, None, type(self)) 1695 self._name = value

RuntimeError: Cannot set name on a level of a MultiIndex. Use 'MultiIndex.set_names' instead.

Use Index.set_names() instead.

Sorting a `MultiIndex`#

For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use sort_index().

In [101]: import random

In [102]: random.shuffle(tuples)

In [103]: s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

In [104]: s Out[104]: baz one 0.206053 bar one -0.251905 baz two -2.213588 qux two 1.063327 bar two 1.266143 qux one 0.299368 foo two -0.863838 one 0.408204 dtype: float64

In [105]: s.sort_index() Out[105]: bar one -0.251905 two 1.266143 baz one 0.206053 two -2.213588 foo one 0.408204 two -0.863838 qux one 0.299368 two 1.063327 dtype: float64

In [106]: s.sort_index(level=0) Out[106]: bar one -0.251905 two 1.266143 baz one 0.206053 two -2.213588 foo one 0.408204 two -0.863838 qux one 0.299368 two 1.063327 dtype: float64

In [107]: s.sort_index(level=1) Out[107]: bar one -0.251905 baz one 0.206053 foo one 0.408204 qux one 0.299368 bar two 1.266143 baz two -2.213588 foo two -0.863838 qux two 1.063327 dtype: float64

You may also pass a level name to sort_index if the MultiIndex levels are named.

In [108]: s.index = s.index.set_names(["L1", "L2"])

In [109]: s.sort_index(level="L1") Out[109]: L1 L2 bar one -0.251905 two 1.266143 baz one 0.206053 two -2.213588 foo one 0.408204 two -0.863838 qux one 0.299368 two 1.063327 dtype: float64

In [110]: s.sort_index(level="L2") Out[110]: L1 L2 bar one -0.251905 baz one 0.206053 foo one 0.408204 qux one 0.299368 bar two 1.266143 baz two -2.213588 foo two -0.863838 qux two 1.063327 dtype: float64

On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:

In [111]: df.T.sort_index(level=1, axis=1) Out[111]: one zero one zero x x y y 0 0.600178 2.410179 1.519970 0.132885 1 0.274230 1.450520 -0.493662 -0.023688

Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

In [112]: dfm = pd.DataFrame( .....: {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)} .....: ) .....:

In [113]: dfm = dfm.set_index(["jim", "joe"])

In [114]: dfm Out[114]: jolie jim joe
0 x 0.490671 x 0.120248 1 z 0.537020 y 0.110968

In [115]: dfm.loc[(1, 'z')] Out[115]: jolie jim joe
1 z 0.53702

Furthermore, if you try to index something that is not fully lexsorted, this can raise:

In [116]: dfm.loc[(0, 'y'):(1, 'z')]

UnsortedIndexError Traceback (most recent call last) Cell In[116], line 1 ----> 1 dfm.loc[(0, 'y'):(1, 'z')]

File ~/work/pandas/pandas/pandas/core/indexing.py:1191, in _LocationIndexer.getitem(self, key) 1189 maybe_callable = com.apply_if_callable(key, self.obj) 1190 maybe_callable = self._check_deprecated_callable_usage(key, maybe_callable) -> 1191 return self._getitem_axis(maybe_callable, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1411, in _LocIndexer._getitem_axis(self, key, axis) 1409 if isinstance(key, slice): 1410 self._validate_key(key, axis) -> 1411 return self._get_slice_axis(key, axis=axis) 1412 elif com.is_bool_indexer(key): 1413 return self._getbool_axis(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1443, in _LocIndexer._get_slice_axis(self, slice_obj, axis) 1440 return obj.copy(deep=False) 1442 labels = obj._get_axis(axis) -> 1443 indexer = labels.slice_indexer(slice_obj.start, slice_obj.stop, slice_obj.step) 1445 if isinstance(indexer, slice): 1446 return self.obj._slice(indexer, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6662, in Index.slice_indexer(self, start, end, step) 6618 def slice_indexer( 6619 self, 6620 start: Hashable | None = None, 6621 end: Hashable | None = None, 6622 step: int | None = None, 6623 ) -> slice: 6624 """ 6625 Compute the slice indexer for input labels and step. 6626 (...) 6660 slice(1, 3, None) 6661 """ -> 6662 start_slice, end_slice = self.slice_locs(start, end, step=step) 6664 # return a slice 6665 if not is_scalar(start_slice):

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2904, in MultiIndex.slice_locs(self, start, end, step) 2852 """ 2853 For an ordered MultiIndex, compute the slice locations for input 2854 labels. (...) 2900 sequence of such. 2901 """ 2902 # This function adds nothing to its parent implementation (the magic 2903 # happens in get_slice_bound method), but it adds meaningful doc. -> 2904 return super().slice_locs(start, end, step)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6879, in Index.slice_locs(self, start, end, step) 6877 start_slice = None 6878 if start is not None: -> 6879 start_slice = self.get_slice_bound(start, "left") 6880 if start_slice is None: 6881 start_slice = 0

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2848, in MultiIndex.get_slice_bound(self, label, side) 2846 if not isinstance(label, tuple): 2847 label = (label,) -> 2848 return self._partial_tup_index(label, side=side)

File ~/work/pandas/pandas/pandas/core/indexes/multi.py:2908, in MultiIndex._partial_tup_index(self, tup, side) 2906 def _partial_tup_index(self, tup: tuple, side: Literal["left", "right"] = "left"): 2907 if len(tup) > self._lexsort_depth: -> 2908 raise UnsortedIndexError( 2909 f"Key length ({len(tup)}) was greater than MultiIndex lexsort depth " 2910 f"({self._lexsort_depth})" 2911 ) 2913 n = len(tup) 2914 start, end = 0, len(self)

UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'

The is_monotonic_increasing() method on a MultiIndex shows if the index is sorted:

In [117]: dfm.index.is_monotonic_increasing Out[117]: False

In [118]: dfm = dfm.sort_index()

In [119]: dfm Out[119]: jolie jim joe
0 x 0.490671 x 0.120248 1 y 0.110968 z 0.537020

In [120]: dfm.index.is_monotonic_increasing Out[120]: True

And now selection works as expected.

In [121]: dfm.loc[(0, "y"):(1, "z")] Out[121]: jolie jim joe
1 y 0.110968 z 0.537020

Take methods#

Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also provides the take() method that retrieves elements along a given axis at the given indices. The given indices must be either a list or an ndarray of integer index positions. take will also accept negative integers as relative positions to the end of the object.

In [122]: index = pd.Index(np.random.randint(0, 1000, 10))

In [123]: index Out[123]: Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64')

In [124]: positions = [0, 9, 3]

In [125]: index[positions] Out[125]: Index([214, 329, 567], dtype='int64')

In [126]: index.take(positions) Out[126]: Index([214, 329, 567], dtype='int64')

In [127]: ser = pd.Series(np.random.randn(10))

In [128]: ser.iloc[positions] Out[128]: 0 -0.179666 9 1.824375 3 0.392149 dtype: float64

In [129]: ser.take(positions) Out[129]: 0 -0.179666 9 1.824375 3 0.392149 dtype: float64

For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.

In [130]: frm = pd.DataFrame(np.random.randn(5, 3))

In [131]: frm.take([1, 4, 3]) Out[131]: 0 1 2 1 -1.237881 0.106854 -1.276829 4 0.629675 -1.425966 1.857704 3 0.979542 -1.633678 0.615855

In [132]: frm.take([0, 2], axis=1) Out[132]: 0 2 0 0.595974 0.601544 1 -1.237881 -1.276829 2 -0.767101 1.499591 3 0.979542 0.615855 4 0.629675 1.857704

It is important to note that the take method on pandas objects are not intended to work on boolean indices and may return unexpected results.

In [133]: arr = np.random.randn(10)

In [134]: arr.take([False, False, True, True]) Out[134]: array([-1.1935, -1.1935, 0.6775, 0.6775])

In [135]: arr[[0, 1]] Out[135]: array([-1.1935, 0.6775])

In [136]: ser = pd.Series(np.random.randn(10))

In [137]: ser.take([False, False, True, True]) Out[137]: 0 0.233141 0 0.233141 1 -0.223540 1 -0.223540 dtype: float64

In [138]: ser.iloc[[0, 1]] Out[138]: 0 0.233141 1 -0.223540 dtype: float64

Finally, as a small note on performance, because the take method handles a narrower range of inputs, it can offer performance that is a good deal faster than fancy indexing.

In [139]: arr = np.random.randn(10000, 5)

In [140]: indexer = np.arange(10000)

In [141]: random.shuffle(indexer)

In [142]: %timeit arr[indexer] .....: %timeit arr.take(indexer, axis=0) .....: 262 us +- 15.4 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each) 75.7 us +- 3.63 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)

In [143]: ser = pd.Series(arr[:, 0])

In [144]: %timeit ser.iloc[indexer] .....: %timeit ser.take(indexer) .....: 141 us +- 6.06 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each) 140 us +- 7.41 us per loop (mean +- std. dev. of 7 runs, 10,000 loops each)

Index types#

We have discussed MultiIndex in the previous sections pretty extensively. Documentation about DatetimeIndex and PeriodIndex are shown here, and documentation about TimedeltaIndex is found here.

In the following sub-sections we will highlight some other index types.

CategoricalIndex#

CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container around a Categoricaland allows efficient indexing and storage of an index with a large number of duplicated elements.

In [145]: from pandas.api.types import CategoricalDtype

In [146]: df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})

In [147]: df["B"] = df["B"].astype(CategoricalDtype(list("cab")))

In [148]: df Out[148]: A B 0 0 a 1 1 a 2 2 b 3 3 b 4 4 c 5 5 a

In [149]: df.dtypes Out[149]: A int64 B category dtype: object

In [150]: df["B"].cat.categories Out[150]: Index(['c', 'a', 'b'], dtype='object')

Setting the index will create a CategoricalIndex.

In [151]: df2 = df.set_index("B")

In [152]: df2.index Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates. The indexers must be in the category or the operation will raise a KeyError.

In [153]: df2.loc["a"] Out[153]: A B
a 0 a 1 a 5

The CategoricalIndex is preserved after indexing:

In [154]: df2.loc["a"].index Out[154]: CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

Sorting the index will sort by the order of the categories (recall that we created the index with CategoricalDtype(list('cab')), so the sorted order is cab).

In [155]: df2.sort_index() Out[155]: A B
c 4 a 0 a 1 a 5 b 2 b 3

Groupby operations on the index will preserve the index nature as well.

In [156]: df2.groupby(level=0, observed=True).sum() Out[156]: A B
c 4 a 6 b 5

In [157]: df2.groupby(level=0, observed=True).sum().index Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a list will return a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the passed Categorical dtype. This allows one to arbitrarily index these even with values not in the categories, similarly to how you can reindex any pandas index.

In [158]: df3 = pd.DataFrame( .....: {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")} .....: ) .....:

In [159]: df3 = df3.set_index("B")

In [160]: df3 Out[160]: A B
a 0 b 1 c 2

In [161]: df3.reindex(["a", "e"]) Out[161]: A B
a 0.0 e NaN

In [162]: df3.reindex(["a", "e"]).index Out[162]: Index(['a', 'e'], dtype='object', name='B')

In [163]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))) Out[163]: A B
a 0.0 e NaN

In [164]: df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index Out[164]: CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')

Warning

Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.

In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})

In [166]: df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))

In [167]: df4 = df4.set_index("B")

In [168]: df4.index Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [169]: df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})

In [170]: df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))

In [171]: df5 = df5.set_index("B")

In [172]: df5.index Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')

In [173]: pd.concat([df4, df5]) Out[173]: A B
b 0 a 1 b 0 c 1

RangeIndex#

RangeIndex is a sub-class of Index that provides the default index for all DataFrame and Series objects.RangeIndex is an optimized version of Index that can represent a monotonic ordered set. These are analogous to Python range types. A RangeIndex will always have an int64 dtype.

In [174]: idx = pd.RangeIndex(5)

In [175]: idx Out[175]: RangeIndex(start=0, stop=5, step=1)

RangeIndex is the default index for all DataFrame and Series objects:

In [176]: ser = pd.Series([1, 2, 3])

In [177]: ser.index Out[177]: RangeIndex(start=0, stop=3, step=1)

In [178]: df = pd.DataFrame([[1, 2], [3, 4]])

In [179]: df.index Out[179]: RangeIndex(start=0, stop=2, step=1)

In [180]: df.columns Out[180]: RangeIndex(start=0, stop=2, step=1)

A RangeIndex will behave similarly to a Index with an int64 dtype and operations on a RangeIndex, whose result cannot be represented by a RangeIndex, but should have an integer dtype, will be converted to an Index with int64. For example:

In [181]: idx[[0, 2]] Out[181]: Index([0, 2], dtype='int64')

IntervalIndex#

IntervalIndex together with its own dtype, IntervalDtypeas well as the Interval scalar type, allow first-class support in pandas for interval notation.

The IntervalIndex allows some unique indexing and is also used as a return type for the categories in cut() and qcut().

Indexing with an `IntervalIndex`#

An IntervalIndex can be used in Series and in DataFrame as the index.

In [182]: df = pd.DataFrame( .....: {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4]) .....: ) .....:

In [183]: df Out[183]: A (0, 1] 1 (1, 2] 2 (2, 3] 3 (3, 4] 4

Label based indexing via .loc along the edges of an interval works as you would expect, selecting that particular interval.

In [184]: df.loc[2] Out[184]: A 2 Name: (1, 2], dtype: int64

In [185]: df.loc[[2, 3]] Out[185]: A (1, 2] 2 (2, 3] 3

If you select a label contained within an interval, this will also select the interval.

In [186]: df.loc[2.5] Out[186]: A 3 Name: (2, 3], dtype: int64

In [187]: df.loc[[2.5, 3.5]] Out[187]: A (2, 3] 3 (3, 4] 4

Selecting using an Interval will only return exact matches.

In [188]: df.loc[pd.Interval(1, 2)] Out[188]: A 2 Name: (1, 2], dtype: int64

Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.

In [189]: df.loc[pd.Interval(0.5, 2.5)]

KeyError Traceback (most recent call last) Cell In[189], line 1 ----> 1 df.loc[pd.Interval(0.5, 2.5)]

File ~/work/pandas/pandas/pandas/core/indexing.py:1431, in _LocIndexer._getitem_axis(self, key, axis) 1429 # fall thru to straight lookup 1430 self._validate_key(key, axis) -> 1431 return self._get_label(key, axis=axis)

File ~/work/pandas/pandas/pandas/core/indexing.py:1381, in _LocIndexer._get_label(self, label, axis) 1379 def _get_label(self, label, axis: AxisInt): 1380 # GH#5567 this will fail if the label is not present in the axis. -> 1381 return self.obj.xs(label, axis=axis)

File ~/work/pandas/pandas/pandas/core/generic.py:4301, in NDFrame.xs(self, key, axis, level, drop_level) 4299 new_index = index[loc] 4300 else: -> 4301 loc = index.get_loc(key) 4303 if isinstance(loc, np.ndarray): 4304 if loc.dtype == np.bool_:

File ~/work/pandas/pandas/pandas/core/indexes/interval.py:678, in IntervalIndex.get_loc(self, key) 676 matches = mask.sum() 677 if matches == 0: --> 678 raise KeyError(key) 679 if matches == 1: 680 return mask.argmax()

KeyError: Interval(0.5, 2.5, closed='right')

Selecting all Intervals that overlap a given Interval can be performed using theoverlaps() method to create a boolean indexer.

In [190]: idxr = df.index.overlaps(pd.Interval(0.5, 2.5))

In [191]: idxr Out[191]: array([ True, True, True, False])

In [192]: df[idxr] Out[192]: A (0, 1] 1 (1, 2] 2 (2, 3] 3

Binning data with `cut` and `qcut`#

cut() and qcut() both return a Categorical object, and the bins they create are stored as an IntervalIndex in its .categories attribute.

In [193]: c = pd.cut(range(4), bins=2)

In [194]: c Out[194]: [(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]] Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [195]: c.categories Out[195]: IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')

cut() also accepts an IntervalIndex for its bins argument, which enables a useful pandas idiom. First, We call cut() with some data and bins set to a fixed number, to generate the bins. Then, we pass the values of .categories as thebins argument in subsequent calls to cut(), supplying new data which will be binned into the same bins.

In [196]: pd.cut([0, 3, 5, 1], bins=c.categories) Out[196]: [(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]] Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

Any value which falls outside all bins will be assigned a NaN value.

Generating ranges of intervals#

If we need intervals on a regular frequency, we can use the interval_range() function to create an IntervalIndex using various combinations of start, end, and periods. The default frequency for interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:

In [197]: pd.interval_range(start=0, end=5) Out[197]: IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [198]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4) Out[198]: IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00], (2017-01-02 00:00:00, 2017-01-03 00:00:00], (2017-01-03 00:00:00, 2017-01-04 00:00:00], (2017-01-04 00:00:00, 2017-01-05 00:00:00]], dtype='interval[datetime64[ns], right]')

In [199]: pd.interval_range(end=pd.Timedelta("3 days"), periods=3) Out[199]: IntervalIndex([(0 days 00:00:00, 1 days 00:00:00], (1 days 00:00:00, 2 days 00:00:00], (2 days 00:00:00, 3 days 00:00:00]], dtype='interval[timedelta64[ns], right]')

The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases with datetime-like intervals:

In [200]: pd.interval_range(start=0, periods=5, freq=1.5) Out[200]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [201]: pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W") Out[201]: IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00], (2017-01-08 00:00:00, 2017-01-15 00:00:00], (2017-01-15 00:00:00, 2017-01-22 00:00:00], (2017-01-22 00:00:00, 2017-01-29 00:00:00]], dtype='interval[datetime64[ns], right]')

In [202]: pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h") Out[202]: IntervalIndex([(0 days 00:00:00, 0 days 09:00:00], (0 days 09:00:00, 0 days 18:00:00], (0 days 18:00:00, 1 days 03:00:00]], dtype='interval[timedelta64[ns], right]')

Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals are closed on the right side by default.

In [203]: pd.interval_range(start=0, end=4, closed="both") Out[203]: IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [204]: pd.interval_range(start=0, end=4, closed="neither") Out[204]: IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')

Specifying start, end, and periods will generate a range of evenly spaced intervals from start to end inclusively, with periods number of elements in the resulting IntervalIndex:

In [205]: pd.interval_range(start=0, end=6, periods=4) Out[205]: IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [206]: pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3) Out[206]: IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:00:00], (2018-02-08 16:00:00, 2018-02-28 00:00:00]], dtype='interval[datetime64[ns], right]')

Miscellaneous indexing FAQ#

Integer indexing#

Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index _only_label-based indexing is possible with the standard tools like .loc. The following code will generate exceptions:

In [207]: s = pd.Series(range(5))

In [208]: s[-1]

ValueError Traceback (most recent call last) File ~/work/pandas/pandas/pandas/core/indexes/range.py:413, in RangeIndex.get_loc(self, key) 412 try: --> 413 return self._range.index(new_key) 414 except ValueError as err:

ValueError: -1 is not in range

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last) Cell In[208], line 1 ----> 1 s[-1]

File ~/work/pandas/pandas/pandas/core/series.py:1121, in Series.getitem(self, key) 1118 return self._values[key] 1120 elif key_is_scalar: -> 1121 return self._get_value(key) 1123 # Convert generator to list before going through hashable part 1124 # (We will iterate through the generator there to check for slices) 1125 if is_iterator(key):

File ~/work/pandas/pandas/pandas/core/series.py:1237, in Series._get_value(self, label, takeable) 1234 return self._values[label] 1236 # Similar to Index.get_value, but we do not fall back to positional -> 1237 loc = self.index.get_loc(label) 1239 if is_integer(loc): 1240 return self._values[loc]

File ~/work/pandas/pandas/pandas/core/indexes/range.py:415, in RangeIndex.get_loc(self, key) 413 return self._range.index(new_key) 414 except ValueError as err: --> 415 raise KeyError(key) from err 416 if isinstance(key, Hashable): 417 raise KeyError(key)

KeyError: -1

In [209]: df = pd.DataFrame(np.random.randn(5, 4))

In [210]: df Out[210]: 0 1 2 3 0 -0.435772 -1.188928 -0.808286 -0.284634 1 -1.815703 1.347213 -0.243487 0.514704 2 1.162969 -0.287725 -0.179734 0.993962 3 -0.212673 0.909872 -0.733333 -0.349893 4 0.456434 -0.306735 0.553396 0.166221

In [211]: df.loc[-2:] Out[211]: 0 1 2 3 0 -0.435772 -1.188928 -0.808286 -0.284634 1 -1.815703 1.347213 -0.243487 0.514704 2 1.162969 -0.287725 -0.179734 0.993962 3 -0.212673 0.909872 -0.733333 -0.349893 4 0.456434 -0.306735 0.553396 0.166221

This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).

Non-monotonic indexes require exact matches#

If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-based slice can be outside the range of the index, much like slice indexing a normal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() andis_monotonic_decreasing() attributes.

In [212]: df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))

In [213]: df.index.is_monotonic_increasing Out[213]: True

no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:

In [214]: df.loc[0:4, :] Out[214]: data 2 0 3 1 3 2 4 3

slice is are outside the index, so empty DataFrame is returned

In [215]: df.loc[13:15, :] Out[215]: Empty DataFrame Columns: [data] Index: []

On the other hand, if the index is not monotonic, then both slice bounds must be_unique_ members of the index.

In [216]: df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))

In [217]: df.index.is_monotonic_increasing Out[217]: False

OK because 2 and 4 are in the index

In [218]: df.loc[2:4, :] Out[218]: data 2 0 3 1 1 2 4 3

0 is not in the index

In [219]: df.loc[0:4, :]

KeyError Traceback (most recent call last) File ~/work/pandas/pandas/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key) 3804 try: -> 3805 return self._engine.get_loc(casted_key) 3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:191, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:234, in pandas._libs.index.IndexEngine._get_loc_duplicates()

File index.pyx:242, in pandas._libs.index.IndexEngine._maybe_get_bool_indexer()

File index.pyx:134, in pandas._libs.index._unpack_bool_indexer()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last) Cell In[219], line 1 ----> 1 df.loc[0:4, :]

File ~/work/pandas/pandas/pandas/core/indexing.py:1184, in _LocationIndexer.getitem(self, key) 1182 if self._is_scalar_access(key): 1183 return self.obj._get_value(*key, takeable=self._takeable) -> 1184 return self._getitem_tuple(key) 1185 else: 1186 # we by definition only have the 0th axis 1187 axis = self.axis or 0

File ~/work/pandas/pandas/pandas/core/indexing.py:1377, in _LocIndexer._getitem_tuple(self, tup) 1374 if self._multi_take_opportunity(tup): 1375 return self._multi_take(tup) -> 1377 return self._getitem_tuple_same_dim(tup)

File ~/work/pandas/pandas/pandas/core/indexing.py:1020, in _LocationIndexer._getitem_tuple_same_dim(self, tup) 1017 if com.is_null_slice(key): 1018 continue -> 1020 retval = getattr(retval, self.name)._getitem_axis(key, axis=i) 1021 # We should never have retval.ndim < self.ndim, as that should 1022 # be handled by the _getitem_lowerdim call above. 1023 assert retval.ndim == self.ndim

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6804, in Index.get_slice_bound(self, label, side) 6801 return self._searchsorted_monotonic(label, side) 6802 except ValueError: 6803 # raise the original KeyError -> 6804 raise err 6806 if isinstance(slc, np.ndarray): 6807 # get_loc may return a boolean array, which 6808 # is OK as long as they are representable by a slice. 6809 assert is_bool_dtype(slc.dtype)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6798, in Index.get_slice_bound(self, label, side) 6796 # we need to look up the label 6797 try: -> 6798 slc = self.get_loc(label) 6799 except KeyError as err: 6800 try:

File ~/work/pandas/pandas/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key) 3807 if isinstance(casted_key, slice) or ( 3808 isinstance(casted_key, abc.Iterable) 3809 and any(isinstance(x, slice) for x in casted_key) 3810 ): 3811 raise InvalidIndexError(key) -> 3812 raise KeyError(key) from err 3813 except TypeError: 3814 # If we have a listlike key, _check_indexing_error will raise 3815 # InvalidIndexError. Otherwise we fall through and re-raise 3816 # the TypeError. 3817 self._check_indexing_error(key)

KeyError: 0

3 is not a unique label

In [220]: df.loc[2:3, :]

KeyError Traceback (most recent call last) Cell In[220], line 1 ----> 1 df.loc[2:3, :]

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6885, in Index.slice_locs(self, start, end, step) 6883 end_slice = None 6884 if end is not None: -> 6885 end_slice = self.get_slice_bound(end, "right") 6886 if end_slice is None: 6887 end_slice = len(self)

File ~/work/pandas/pandas/pandas/core/indexes/base.py:6812, in Index.get_slice_bound(self, label, side) 6810 slc = lib.maybe_booleans_to_slice(slc.view("u1")) 6811 if isinstance(slc, np.ndarray): -> 6812 raise KeyError( 6813 f"Cannot get {side} slice bound for non-unique " 6814 f"label: {repr(original_label)}" 6815 ) 6817 if isinstance(slc, slice): 6818 if side == "left":

KeyError: 'Cannot get right slice bound for non-unique label: 3'

Index.is_monotonic_increasing and Index.is_monotonic_decreasing only check that an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with the is_unique() attribute.

In [221]: weakly_monotonic = pd.Index(["a", "b", "c", "c"])

In [222]: weakly_monotonic Out[222]: Index(['a', 'b', 'c', 'c'], dtype='object')

In [223]: weakly_monotonic.is_monotonic_increasing Out[223]: True

In [224]: weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique Out[224]: False

Endpoints are inclusive#

Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index. For example, consider the following Series:

In [225]: s = pd.Series(np.random.randn(6), index=list("abcdef"))

In [226]: s Out[226]: a -0.101684 b -0.734907 c -0.130121 d -0.476046 e 0.759104 f 0.213379 dtype: float64

Suppose we wished to slice from c to e, using integers this would be accomplished as such:

In [227]: s[2:5] Out[227]: c -0.130121 d -0.476046 e 0.759104 dtype: float64

However, if you only had c and e, determining the next element in the index can be somewhat complicated. For example, the following does not work:

In [228]: s.loc['c':'e' + 1]

TypeError Traceback (most recent call last) Cell In[228], line 1 ----> 1 s.loc['c':'e' + 1]

TypeError: can only concatenate str (not "int") to str

A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the design choice to make label-based slicing include both endpoints:

In [229]: s.loc["c":"e"] Out[229]: c -0.130121 d -0.476046 e 0.759104 dtype: float64

This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.

Indexing potentially changes underlying Series dtype#

The different indexing operation can potentially change the dtype of a Series.

In [230]: series1 = pd.Series([1, 2, 3])

In [231]: series1.dtype Out[231]: dtype('int64')

In [232]: res = series1.reindex([0, 4])

In [233]: res.dtype Out[233]: dtype('float64')

In [234]: res Out[234]: 0 1.0 4 NaN dtype: float64

In [235]: series2 = pd.Series([True])

In [236]: series2.dtype Out[236]: dtype('bool')

In [237]: res = series2.reindex_like(series1)

In [238]: res.dtype Out[238]: dtype('O')

In [239]: res Out[239]: 0 True 1 NaN 2 NaN dtype: object

This is because the (re)indexing operations above silently inserts NaNs and the dtypechanges accordingly. This can cause some issues when using numpy ufuncssuch as numpy.logical_and.

See the GH 2388 for a more detailed discussion.

advanced indexing — pandas 2.2.3 documentation (original) (raw)

MultiIndex / advanced indexing#

Hierarchical indexing (MultiIndex)#

Creating a MultiIndex (hierarchical index) object#

Reconstructing the level labels#

Basic indexing on axis with MultiIndex#

Defined levels#

for a specific level

Data alignment and using reindex#

Advanced indexing with hierarchical index#

Using slicers#

Cross-section#

using the slicers

using the slicers

using the slicers

Advanced reindexing and alignment#

aligning

Swapping levels with swaplevel#

Reordering levels with reorder_levels#

Renaming names of an Index or MultiIndex#

In [100]: mi.levels[0].name = "name via level"

Sorting a MultiIndex#

In [116]: dfm.loc[(0, 'y'):(1, 'z')]

Take methods#

Index types#

CategoricalIndex#

RangeIndex#

IntervalIndex#

Indexing with an IntervalIndex#

In [189]: df.loc[pd.Interval(0.5, 2.5)]

Binning data with cut and qcut#

Generating ranges of intervals#

Miscellaneous indexing FAQ#

Integer indexing#

In [208]: s[-1]

Non-monotonic indexes require exact matches#

no rows 0 or 1, but still returns rows 2, 3 (both of them), and 4:

slice is are outside the index, so empty DataFrame is returned

OK because 2 and 4 are in the index

0 is not in the index

In [219]: df.loc[0:4, :]

3 is not a unique label

In [220]: df.loc[2:3, :]

Endpoints are inclusive#

In [228]: s.loc['c':'e' + 1]

Indexing potentially changes underlying Series dtype#

Data alignment and using `reindex`#

Swapping levels with `swaplevel`#

Reordering levels with `reorder_levels`#

Renaming names of an `Index` or `MultiIndex`#

Sorting a `MultiIndex`#

Indexing with an `IntervalIndex`#

Binning data with `cut` and `qcut`#