API: capabilities of df.set_index · Issue #24046 · pandas-dev/pandas (original) (raw)
This is coming out of a discussion that has stalled #22225 (which is about adding .set_index
to Series, see #21684). The discussion has shifted away from what capabilities a putative Series.set_index
should have, but what capabilities df.set_index
has currently.
The main issue (for @jreback) is that df.set_index
takes arrays:
@jreback: There were several attempts to have DataFrame.set_index take an array as well, but these never got off the ground.
@h-vetinari: I'm not sure when, but they certainly did get off the ground:
>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'0.23.4'
>>>
>>> df = pd.DataFrame(np.random.randint(0, 10, (4, 4)), columns=list('abcd'))
>>> df.set_index(['a', # label
... df.index, # Index
... df.b ** 2, # Series
... df.b.values, # ndarray
... list('ABCD'), # list
... 'c']) # label again
b d
a b c
0 0 0 2 A 1 0 2
8 1 1 4 B 4 1 4
3 2 25 5 C 8 5 5
0 3 9 7 D 2 3 7
Further on:
@jreback: @h-vetinari you are confusing the purpose of
.set_axis
. [...] The problem with.set_index
on a DataFrame with an array is that it technically can work with an array and not keys. (meaning its not unambiguous)
I don't think I am confusing them. If I want to set the .index
-attribute of a Series/DataFrame, then using .set_index
is the most reasonable name by far. If anything, set_axis
should be a superset of set_index
(and a putative set_columns
), that just switches between the two based on the axis
-kwarg.
More than that, the current capabilities of df.set_index
are a proper superset of df.set_axis(axis=0)
**, in that it's possible to fill keys
with only Series
/Index
/ndarray
/list
etc.:
>>> df.set_index(pd.Index(df.a)) # same result as Series directly below
>>> df.set_index(df.a)
a b c d
a
0 0 0 1 2
8 8 1 4 4
3 3 5 8 5
0 0 3 2 7
>>> df.set_index(df.a.values) # same result as list directly below
>>> df.set_index([[0, 8, 3, 0]])
a b c d
0 0 0 1 2
8 8 1 4 4
3 3 5 8 5
0 0 3 2 7
** there is one caveat, in that lists (and only lists; out of all containers) need to be wrapped in another list, i.e. df.set_index([[0, 8, 3, 0]])
instead of df.set_index([0, 8, 3, 0])
. This is the heart of the ambiguity that @jreback mentioned above (because a list is interpreted as a list of column keys).
Summing up:
set_index
is the most natural name for setting the.index
-attributedf.set_index
should be able to process list-likes (as it currently does; this is the source of the ambiguity of the list case).df.set_axis
should be able to do everything thatdf.set_index
does, and just switch between operating on index/columns based on theaxis
-kwarg (after all,index
andcolumns
are the two axes of a DF).- it could be considered to add a method
set_columns
on aDataFrame
- The
axis
-kwarg ofset_axis
should just switch between the behaviour ofset_index
(i.e. dealing with keys and array-likes) andset_columns
.
- it could be considered to add a method
Series.set_index
should support the same signature asdf.set_index
, with the exception of thedrop
-keyword (which only makes sense for column labels).- For Series, the
set_index
andset_axis
methods should be exactly the same.
Since I can't tag @pandas-dev/pandas-core, here are a few individual tags: @jreback @TomAugspurger @jorisvandenbossche @gfyoung @WillAyd @jbrockmendel @jschendel @toobaz.
EDIT: Forgot to add an xref from @jreback:
@h-vetinari we had quite some discussion about this: #14829
and never reached resolution. This is an API question.
In that issue, there's discussion largely around .rename
, and how to make that method more consistent. Also discussed was potentially introducing .relabel
, as well as .set_columns
.