ENH: Preserve .names in df.set_index(df.index) by qwhelan · Pull Request #6459 · pandas-dev/pandas (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation13 Commits1 Checks0 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
closes #6452.
This causes a slight change in behavior in @jseabold's second example. Previously, df.set_index(df.index)
would convert a MultiIndex
into an Index
of tuples:
In [7]: from statsmodels.datasets import grunfeld
In [8]: data = sm.datasets.grunfeld.load_pandas().data
In [9]: data = data.set_index(['firm', 'year'])
In [10]: data.set_index(data.index).index.names
Out[10]: [None]
In [11]: data
Out[11]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 220 entries, (General Motors, 1935.0) to (American Steel, 1954.0)
Data columns (total 3 columns):
invest 220 non-null values
value 220 non-null values
capital 220 non-null values
dtypes: float64(3)
In [13]: data.set_index(data.index)
Out[13]:
<class 'pandas.core.frame.DataFrame'>
Index: 220 entries, (General Motors, 1935.0) to (American Steel, 1954.0)
Data columns (total 3 columns):
invest 220 non-null values
value 220 non-null values
capital 220 non-null values
dtypes: float64(3)
This change makes it so the index remains a MultiIndex
.
this is not the way to fix this
instead catch the case of a passed index and treat it like a series rather than the ndarray case
Treating it like a Series doesn't fix the issue described above. What should df.set_index([df.index, df.index])
return when the index is a MultiIndex? Currently (and in this patch), this would create an Index of two pairs of tuples.
The more correct behavior, in my opinion, would be to return a 4-level MultiIndex. This would require treating the MultiIndex as a DataFrame here or modify from_arrays
to detect this case (presumably undesirable).
@jreback This newest commit is what I'm thinking. Let me know if there's a more elegant way to get the columns out of a MultiIndex.
@jreback, thanks for the suggestions. Most recent commit has those changes.
looks good. can you add a note to release.rst and v0.14.0.txt both in the API sections, reference this issue and provide a short explanation (prob just a 1-liner - fi you think more is warranted you can do this in v0.14.0.txt with an example - but only if not clear what the change does)
@jreback Added to notes and rebased. Sorry for the delay - I've been sick the last few days.
df = pd.util.testing.makeDataFrame() |
---|
df.index.name = 'name' |
assert df.set_index(df.index).index.names == ['name'] |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.assertEquals
for these rather than a bare assert
@jreback Added an example to whatsnew. Let me know if you'd like this branch squashed/rebased.
looks good
pls squash down to 1-2 commits and good to merge
Preserve .names in df.set_index(df.index)
Check that df.set_index(df.index) doesn't convert a MultiIndex to an Index
Handle general case of df.set_index([df.index,...])
Cleanup
Add to release notes
Add equality checks
Fix issue on 2.6
Add example to whatsnew
Alright, squashed and rebased.
jreback added a commit that referenced this pull request
ENH: Preserve .names in df.set_index(df.index)
2 participants