BUG: Series.argsort() incorrect handling of NaNs · Issue #12694 · pandas-dev/pandas (original) (raw)

It appears that Series.argsort() is implemented as s.dropna().argsort().reindex(s.index, fill_value=-1) = np.argsort(s.dropna()).reindex(s.index, fill_value=-1).

There are two problems with this:
(a) Since the result represents integer indices into the original series s, the result should not have the same index as s.index -- it should either be a Series with index [0, 1, ...], or more likely simply be a NumPyarray;
(b) The way NaNs are effectively removed before calling np.argsort() leads to indexes that are no longer appropriate into the original Series, resulting in the nonsensical results shown in [9] and [10] below.

Note that: "As of NumPy 1.4.0 argsort works with real/complex arrays containing nan values." So it's not obvious to me that there's much to be gained from mucking around with np.argsort(s.values).

Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 20:20:57) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: s = pd.Series([200, 100, 400, 500, np.nan, 300], index=list('abcdef'))

In [4]: s
Out[4]:
a    200.0
b    100.0
c    400.0
d    500.0
e      NaN
f    300.0
dtype: float64

In [5]: s.argsort()
Out[5]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [6]: s.dropna().argsort().reindex(s.index, fill_value=-1)
Out[6]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [33]: np.argsort(s.dropna()).reindex(s.index, fill_value=-1)
Out[33]:
a    1
b    0
c    4
d    2
e   -1
f    3
dtype: int64

In [34]: np.argsort(s.dropna().values)
Out[34]: array([1, 0, 4, 2, 3], dtype=int64)

In [7]: np.argsort(s.values)
Out[7]: array([1, 0, 5, 2, 3, 4], dtype=int64)

In [8]: s[np.argsort(s.values)]  # desired result
Out[8]:
b    100.0
a    200.0
f    300.0
c    400.0
d    500.0
e      NaN
dtype: float64

In [9]: s[s.argsort()]  # nonsensical result
Out[9]:
b    100.0
a    200.0
e      NaN
c    400.0
f    300.0
d    500.0
dtype: float64

In [10]: s[s.argsort().values]  # nonsensical result
Out[10]:
b    100.0
a    200.0
e      NaN
c    400.0
f    300.0
d    500.0
dtype: float64

In [11]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3.1
Cython: None
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: None
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.4
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 1.0b8
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None