API: capabilities of df.set_index · Issue #24046 · pandas-dev/pandas (original) (raw)

This is coming out of a discussion that has stalled #22225 (which is about adding .set_index to Series, see #21684). The discussion has shifted away from what capabilities a putative Series.set_index should have, but what capabilities df.set_index has currently.

The main issue (for @jreback) is that df.set_index takes arrays:

@jreback: There were several attempts to have DataFrame.set_index take an array as well, but these never got off the ground.

@h-vetinari: I'm not sure when, but they certainly did get off the ground:

>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'0.23.4'
>>>
>>> df = pd.DataFrame(np.random.randint(0, 10, (4, 4)), columns=list('abcd'))
>>> df.set_index(['a',          # label
...               df.index,     # Index
...               df.b ** 2,    # Series
...               df.b.values,  # ndarray
...               list('ABCD'), # list
...               'c'])         # label again
              b  d
a   b      c
0 0 0  2 A 1  0  2
8 1 1  4 B 4  1  4
3 2 25 5 C 8  5  5
0 3 9  7 D 2  3  7

Further on:

@jreback: @h-vetinari you are confusing the purpose of .set_axis. [...] The problem with .set_index on a DataFrame with an array is that it technically can work with an array and not keys. (meaning its not unambiguous)

I don't think I am confusing them. If I want to set the .index-attribute of a Series/DataFrame, then using .set_index is the most reasonable name by far. If anything, set_axis should be a superset of set_index (and a putative set_columns), that just switches between the two based on the axis-kwarg.

More than that, the current capabilities of df.set_index are a proper superset of df.set_axis(axis=0)**, in that it's possible to fill keys with only Series/Index/ndarray/list etc.:

>>> df.set_index(pd.Index(df.a))  # same result as Series directly below
>>> df.set_index(df.a) 
   a  b  c  d
a
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7
>>> df.set_index(df.a.values)  # same result as list directly below
>>> df.set_index([[0, 8, 3, 0]])
   a  b  c  d
0  0  0  1  2
8  8  1  4  4
3  3  5  8  5
0  0  3  2  7

** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e. df.set_index([[0, 8, 3, 0]]) instead of df.set_index([0, 8, 3, 0]). This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).

Summing up:

set_index is the most natural name for setting the .index-attribute
df.set_index should be able to process list-likes (as it currently does; this is the source of the ambiguity of the list case).
df.set_axis should be able to do everything that df.set_index does, and just switch between operating on index/columns based on the axis-kwarg (after all, index and columns are the two axes of a DF).
- it could be considered to add a method set_columns on a DataFrame
- The axis-kwarg of set_axis should just switch between the behaviour of set_index (i.e. dealing with keys and array-likes) and set_columns.
Series.set_index should support the same signature as df.set_index, with the exception of the drop-keyword (which only makes sense for column labels).
For Series, the set_index and set_axis methods should be exactly the same.

Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.

EDIT: Forgot to add an xref from @jreback:

@h-vetinari we had quite some discussion about this: #14829
and never reached resolution. This is an API question.

In that issue, there's discussion largely around .rename, and how to make that method more consistent. Also discussed was potentially introducing .relabel, as well as .set_columns.