ER/DOC: Sorting in multi-index columns: misleading error message, unclear docs (original) (raw)

related #739

Have a look at this example:

import pandas as pd import numpy as np from StringIO import StringIO print "Pandas version %s\n\n" % pd.version

data1 = """idx,metric 0,2.1 1,2.5 2,3"""

data2 = """idx,metric 0,2.7 1,2.2 2,2.8"""

df1 = pd.read_csv(StringIO(data1)) df2 = pd.read_csv(StringIO(data2)) concatenated = pd.concat([df1, df2], ignore_index=True) merged = concatenated.groupby("idx").agg([np.mean, np.std])

print merged print merged.sort('metric')

and its output:

$ python test.py 
Pandas version 0.11.0


     metric          
       mean       std
idx                  
0      2.40  0.424264
1      2.35  0.212132
2      2.90  0.141421
Traceback (most recent call last):
  File "test.py", line 22, in <module>
    print merged.sort('metric')
  File "/***/Python-2.7.3/lib/python2.7/site-packages/pandas/core/frame.py", line 3098, in sort
    inplace=inplace)
  File "/***/Python-2.7.3/lib/python2.7/site-packages/pandas/core/frame.py", line 3153, in sort_index
    % str(by))
ValueError: Cannot sort by duplicate column metric

The problem here is not that there is a duplicate column metric as stated by the error message. The problem is that there are still two sub-levels. The solution in this case is to use

merged.sort([('metric', 'mean')])

for sorting by the mean of the metric. It took myself quite a while to figure this out. First of all, the error message should be more clear in this case. Then, maybe I was too stupid, but I could not find the solution in the docs, but within a thread on StackOverflow. Looks like the error message above is the result of an over-generalized condition around https://github.com/pydata/pandas/blob/v0.12.0rc1/pandas/core/frame.py#L3269