BUG: Mergesort is unstable when ascending=False · Issue #6399 · pandas-dev/pandas (original) (raw)
The Issue
When using DataFrame.sort_by(kind='mergesort')
, sorting is supposed to be stable. Unfortunately, that is not the case when ascending=False
.
Create a DataFrame where the first column is already in descending order.
In [96]: df = pd.DataFrame([[2, 'first'], [2, 'second'], [1, 'a'], [1, 'b']], columns=['sort_col', 'order'])
In [97]: df Out[97]: sort_col order 0 2 first 1 2 second 2 1 a 3 1 b
[4 rows x 2 columns]
Look at the 'order' column. Clearly not stable.
In [98]: df.sort_index(by='sort_col', kind='mergesort', ascending=False) Out[98]: sort_col order 1 2 second 0 2 first 3 1 b 2 1 a
[4 rows x 2 columns]
More Info
Inside the sort_by()
source code, argsort()
is called on the sorted column. Then, ascending
is checked and if it is False, the indexes are simply reversed. Here is the relevant code snippet in pandas/core/frame.py
:
k = self[by].values ... indexer = k.argsort(kind=kind) ... if not ascending: indexer = indexer[::-1]
Since numpy always sorts in ascending order, this actually guarantees sorting is always unstable! Check this out:
In [110]: df['sort_col'].values.argsort(kind='mergesort')[::-1] Out[110]: array([1, 0, 3, 2])
Clearly, simply reversing the indices doesn't work. We need to sort a reversed k
, reverse the indices, and then subtract the indices from the highest index so they correspond to the original k
:
In [112]: 3 - df['sort_col'].values[::-1].argsort(kind='mergesort')[::-1] Out[112]: array([0, 1, 2, 3])
The workaround in my code is to stable sort descending is reverse the DataFrame, sort ascending, and reverse again.
What is the best way to fix this? This is probably an easy fix, but I've never contributed to pandas, so I need to set up my fork and make sure I can run tests before working on a pull request.
My Versions
Here are my versions:
In [75]: show_versions()
INSTALLED VERSIONS
Python: 3.3.3.final.0 OS: Linux Release: 3.12.9-2-ARCH Processor: byteorder: little LC_ALL: None LANG: en_US.UTF-8
pandas: 0.13.0 Cython: 0.20 Numpy: 1.8.0 Scipy: 0.13.0 statsmodels: Not installed patsy: Not installed scikits.timeseries: Not installed dateutil: 2.2 pytz: 2013.9 bottleneck: Not installed PyTables: Not Installed numexpr: Not Installed matplotlib: 1.3.1 openpyxl: 1.8.2 xlrd: 0.9.2 xlwt: Not installed xlsxwriter: Not installed sqlalchemy: Not installed lxml: Not installed bs4: Not installed html5lib: Not installed bigquery: Not installed apiclient: Not installed