BUG: different behaviors of sort_index() and sort_index(level=0) · Issue #13431 · pandas-dev/pandas (original) (raw)

Inspired by some bug reports around multiindex sortedness (http://stackoverflow.com/questions/31427466/ensuring-lexicographical-sort-in-pandas-multiindex, #10651, #9212), I found that sort_index() sometimes can't make a multiindex ready for slicing, but sort_index(level=0) (so does sortlevel()) can.

In [119]: pd.version Out[119]: u'0.18.1'

In [120]: df = pd.DataFrame({'col1': ['b','d','b','a'], 'col2': [3,1,1,2], 'data':['one','two','three','four']})

In [121]: df2 = df.set_index(['col1','col2'])

In [122]: df2.index.set_levels(['b','d','a'], level='col1', inplace=True)

In [123]: df2.index.set_labels([0,1,0,2], level='col1', inplace=True)

In [124]: df2.sortlevel() Out[124]: data col1 col2 b 1 three 3 one d 1 two a 2 four

In [125]: df2.sort_index() Out[125]: data col1 col2 a 2 four b 1 three 3 one d 1 two

In [126]: df2.sort_index(level=0) Out[126]: data col1 col2 b 1 three 3 one d 1 two a 2 four

While df2.sort_index() does give a visually lexicographically sorted output, it DOES NOT support slicing.

df2.sort_index().loc['b':'d']

KeyError Traceback (most recent call last)

a lot of lines omitted here.

/Users/yimengzh/miniconda/envs/cafferc3/lib/python2.7/site-packages/pandas/indexes/multi.py in _partial_tup_index(self, tup, side) 1488 raise KeyError('Key length (%d) was greater than MultiIndex' 1489 ' lexsort depth (%d)' % -> 1490 (len(tup), self.lexsort_depth)) 1491 1492 n = len(tup)

KeyError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

So I have two questions.

  1. Is this the intended behavior? I thought level=0 and level=None are synonyms to me, but they are not. Looking at the code, https://github.com/pydata/pandas/blob/4de83d25d751d8ca102867b2d46a5547c01d7248/pandas/core/frame.py#L3245-L3247 indeed there's a special processing when level is not None.
  2. What does "lexicographically sorted" mean? I think it should mean sorted in terms of levels, not labels. Is this what "lexicographically sorted" means in the doc for Advanced Indexing? If this is true, then I think make_index(level=0) is correct, yet make_index() is not.

Thanks.

output of pd.show_versions()

In [130]: pd.show_versions()

INSTALLED VERSIONS

commit: None python: 2.7.11.final.0 python-bits: 64 OS: Darwin OS-release: 14.5.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.UTF-8

pandas: 0.18.1 nose: 1.3.7 pip: 8.1.2 setuptools: 23.0.0 Cython: 0.23.4 numpy: 1.11.0 scipy: 0.17.0 statsmodels: None xarray: None IPython: 4.2.0 sphinx: None patsy: None dateutil: 2.5.3 pytz: 2016.4 blosc: None bottleneck: None tables: None numexpr: None matplotlib: 1.5.1 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: None httplib2: None apiclient: None sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.8 boto: None pandas_datareader: None