ER/DOC: Sorting in multi-index columns: misleading error message, unclear docs · Issue #4370 · pandas-dev/pandas (original) (raw)
related #739
Have a look at this example:
import pandas as pd import numpy as np from StringIO import StringIO print "Pandas version %s\n\n" % pd.version
data1 = """idx,metric 0,2.1 1,2.5 2,3"""
data2 = """idx,metric 0,2.7 1,2.2 2,2.8"""
df1 = pd.read_csv(StringIO(data1)) df2 = pd.read_csv(StringIO(data2)) concatenated = pd.concat([df1, df2], ignore_index=True) merged = concatenated.groupby("idx").agg([np.mean, np.std])
print merged print merged.sort('metric')
and its output:
$ python test.py
Pandas version 0.11.0
metric
mean std
idx
0 2.40 0.424264
1 2.35 0.212132
2 2.90 0.141421
Traceback (most recent call last):
File "test.py", line 22, in <module>
print merged.sort('metric')
File "/***/Python-2.7.3/lib/python2.7/site-packages/pandas/core/frame.py", line 3098, in sort
inplace=inplace)
File "/***/Python-2.7.3/lib/python2.7/site-packages/pandas/core/frame.py", line 3153, in sort_index
% str(by))
ValueError: Cannot sort by duplicate column metric
The problem here is not that there is a duplicate column metric
as stated by the error message. The problem is that there are still two sub-levels. The solution in this case is to use
merged.sort([('metric', 'mean')])
for sorting by the mean of the metric. It took myself quite a while to figure this out. First of all, the error message should be more clear in this case. Then, maybe I was too stupid, but I could not find the solution in the docs, but within a thread on StackOverflow. Looks like the error message above is the result of an over-generalized condition around https://github.com/pydata/pandas/blob/v0.12.0rc1/pandas/core/frame.py#L3269