API: unified sorting · Issue #8239 · pandas-dev/pandas (original) (raw)
originally #5190
xref #9816
xref #3942
This issue is for creating a unified API to Series & DataFrame sorting methods. Panels are not addressed (yet) but a unified API should be easy to extend to them. Related are #2094, #5190, #6847, #7121, #2615. As discussion proceeds, this post will be edited.
For reference, the 0.14.1 signatures are:
Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True) Series.sort_index(ascending=True) Series.sortlevel(level=0, ascending=True, sort_remaining=True)
DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort', na_position='last') DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)
Proposed unified signature for Series.sort
and DataFrame.sort
(except Series version retains current inplace=True):
def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True): """Sort by labels (along either axis), by the values in column(s) or both.
If both, labels take precedence over columns. If neither is specified,
behavior is object-dependent: series = on values, dataframe = on index.
Parameters
----------
by : column name or list of column names
if not None, sort on values in specified column name; perform nested
sort if list of column names specified. this argument ignored by series
axis : {0, 1}
sort index/rows (0) or columns (1); for Series, only 0 can be specified
level : int or level name or list of ints or list of column names
if not None, sort on values in specified index level(s)
ascending : bool or list of bool
Sort ascending vs. descending. Specify list for multiple sort orders.
inplace : bool
if True, perform operation in-place (without creating new instance)
kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
Choice of sorting algorithm. See np.sort for more information.
‘mergesort’ is the only stable algorithm. For data frames, this option is
only applied when sorting on a single column or label.
na_position : {'first', 'last'}
‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
sort_remaining : bool
if true and sorting by level and index is multilevel, sort by other levels
too (in order) after sorting by specified level
"""
The sort_index
signatures change too and sort_columns
is created:
Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True) DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True) # by is DEPRECATED, see change 7
DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True) # or maybe level=None
Proposed changes:
makemaybe, possibly in 1.0inplace=False
default (changesSeries.sort
)- new
by
argument to accept column-name/list-of-column-names in first position- deprecate
columns
keyword ofDataFrame.sort
, replaced withby
(df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal) - don't allow tuples to access levels of multi-index (
columns
arg ofDataFrame.sort
allows tuples); use newlevel
argument instead - don't swap order of
by
/axis
inDataFrame.sort_index
(see change 7) - this argument is ignored by series but
axis
is too so for the sake of working with dataframes, it gets first position
- deprecate
- new
level
argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)- replaces tuple behavior of
columns
arg ofDataFrame.sort
- add
level
argument tosort_index
in first position so level(s) of multilevel index can be specified; this makessort_index
==sortlevel
(see change 8) - also adds
sort_remaining
arg to handle multi-level indexes
- replaces tuple behavior of
- new method
DataFrame.sort_columns
==sort(axis=1)
(see syntax below) - deprecate
Series.order
since change 1 makesSeries.sort
equivalent (?) - add
inplace
,kind
, andna_position
arguments toSeries.sort_index
(to matchDataFrame.sort_index
);by
andaxis
args are not added since they don't make sense for series - deprecate and eventually remove
by
argument fromDataFrame.sort_index
since it makessort_index
equivalent tosort
- deprecate
sortlevel
since change 3b makessort_index
equivalent
Notes:
- default behavior of
sort
is still object-dependent: for series, sorts by values and for data frames, sorts by index - new
level
arg makessort_index
andsortlevel
equivalent. if sortlevel is retained:- should rename
sortlevel
tosort_level
for naming conventions Series.sortlevel
should haveinplace
argument added- maybe don't add
level
andsort_remaining
args tosort_index
so it's not equivalent tosort_level
(intentionally limiting sort_index seems like a bad idea though)
- should rename
- it's unclear if default should be
level=None
forsort_columns
. probably not since level=None falls back to level=0 anyway - both
by
andaxis
arguments should be ignored bySeries.sort
Syntax:
- dataframes
sort()
==sort(level=0)
==sort_index()
==sortlevel()
* without columns or level specified, defaults to current behavior of sort on indexsort(['A','B'])
* since columns are specified, default index sort should not occur; sorting only happens using columns 'A' and 'B'sort(level='spam')
==sort_index('spam')
==sortlevel('spam')
* sort occurs on row index named 'spam' or level of multi-index named 'spam'sort(['A','B'], level='spam')
*level
controls here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'sort(axis=1)
==sort(axis=1, level=0)
==sort_columns()
* since data frames default to sort on index, leaving level=None is the same as level=0sort(['A','B'], axis=1)
==sort_columns(['A','B'])
* as with preceding example, level=None becomes level=0 in sort_columnssort(['A','B'], axis=1, level='spam')
==sort_columns(['A','B'], level='spam')
*axis
controlslevel
so sort will be on columns named 'A' and 'B' in column index named 'spam'
- series:
sort()
==order()
-- sorts on values- with
level
specified, sorts on index/named index/level of multi-index:
*sort(level=0)
==sort_index()
==sortlevel()
*sort(level='spam')
==sort_index('spam')
==sortlevel('spam')
Comments welcome.