API: unified sorting · Issue #8239 · pandas-dev/pandas (original) (raw)

originally #5190
xref #9816
xref #3942

This issue is for creating a unified API to Series & DataFrame sorting methods. Panels are not addressed (yet) but a unified API should be easy to extend to them. Related are #2094, #5190, #6847, #7121, #2615. As discussion proceeds, this post will be edited.

For reference, the 0.14.1 signatures are:

Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True) Series.sort_index(ascending=True) Series.sortlevel(level=0, ascending=True, sort_remaining=True)

DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort', na_position='last') DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)

Proposed unified signature for Series.sort and DataFrame.sort (except Series version retains current inplace=True):

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True): """Sort by labels (along either axis), by the values in column(s) or both.

     If both, labels take precedence over columns. If neither is specified,
     behavior is object-dependent: series = on values, dataframe = on index.

     Parameters
     ----------
     by : column name or list of column names
         if not None, sort on values in specified column name; perform nested
         sort if list of column names specified. this argument ignored by series
     axis : {0, 1}
         sort index/rows (0) or columns (1); for Series, only 0 can be specified
     level : int or level name or list of ints or list of column names
         if not None, sort on values in specified index level(s)
     ascending : bool or list of bool
         Sort ascending vs. descending. Specify list for multiple sort orders.
     inplace : bool
         if True, perform operation in-place (without creating new instance)
     kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
         Choice of sorting algorithm. See np.sort for more information. 
         ‘mergesort’ is the only stable algorithm. For data frames, this option is 
         only applied when sorting on a single column or label.
     na_position : {'first', 'last'}
         ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
     sort_remaining : bool
         if true and sorting by level and index is multilevel, sort by other levels
         too (in order) after sorting by specified level
     """

The sort_index signatures change too and sort_columns is created:

Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True) DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True) # by is DEPRECATED, see change 7

DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True) # or maybe level=None

Proposed changes:

  1. make inplace=False default (changes Series.sort) maybe, possibly in 1.0
  2. new by argument to accept column-name/list-of-column-names in first position
    • deprecate columns keyword of DataFrame.sort, replaced with by (df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal)
    • don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead
    • don't swap order of by/axis in DataFrame.sort_index (see change 7)
    • this argument is ignored by series but axis is too so for the sake of working with dataframes, it gets first position
  3. new level argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)
    • replaces tuple behavior of columns arg of DataFrame.sort
    • add level argument to sort_index in first position so level(s) of multilevel index can be specified; this makes sort_index==sortlevel (see change 8)
    • also adds sort_remaining arg to handle multi-level indexes
  4. new method DataFrame.sort_columns==sort(axis=1) (see syntax below)
  5. deprecate Series.order since change 1 makes Series.sort equivalent (?)
  6. add inplace, kind, and na_position arguments to Series.sort_index (to match DataFrame.sort_index); by and axis args are not added since they don't make sense for series
  7. deprecate and eventually remove by argument from DataFrame.sort_index since it makes sort_index equivalent to sort
  8. deprecate sortlevel since change 3b makes sort_index equivalent

Notes:

Syntax:

Comments welcome.