BUG: Series.argsort() incorrect handling of NaNs · Issue #12694 · pandas-dev/pandas (original) (raw)
It appears that Series.argsort()
is implemented as s.dropna().argsort().reindex(s.index, fill_value=-1) = np.argsort(s.dropna()).reindex(s.index, fill_value=-1)
.
There are two problems with this:
(a) Since the result represents integer indices into the original series s
, the result should not have the same index
as s.index
-- it should either be a Series
with index [0, 1, ...]
, or more likely simply be a NumPyarray
;
(b) The way NaN
s are effectively removed before calling np.argsort()
leads to indexes that are no longer appropriate into the original Series
, resulting in the nonsensical results shown in [9]
and [10]
below.
Note that: "As of NumPy 1.4.0 argsort
works with real/complex arrays containing nan values." So it's not obvious to me that there's much to be gained from mucking around with np.argsort(s.values)
.
Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 20:20:57) [MSC v.1600 64 bit (AMD64)]
Type "copyright", "credits" or "license" for more information.
IPython 4.1.2 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([200, 100, 400, 500, np.nan, 300], index=list('abcdef'))
In [4]: s
Out[4]:
a 200.0
b 100.0
c 400.0
d 500.0
e NaN
f 300.0
dtype: float64
In [5]: s.argsort()
Out[5]:
a 1
b 0
c 4
d 2
e -1
f 3
dtype: int64
In [6]: s.dropna().argsort().reindex(s.index, fill_value=-1)
Out[6]:
a 1
b 0
c 4
d 2
e -1
f 3
dtype: int64
In [33]: np.argsort(s.dropna()).reindex(s.index, fill_value=-1)
Out[33]:
a 1
b 0
c 4
d 2
e -1
f 3
dtype: int64
In [34]: np.argsort(s.dropna().values)
Out[34]: array([1, 0, 4, 2, 3], dtype=int64)
In [7]: np.argsort(s.values)
Out[7]: array([1, 0, 5, 2, 3, 4], dtype=int64)
In [8]: s[np.argsort(s.values)] # desired result
Out[8]:
b 100.0
a 200.0
f 300.0
c 400.0
d 500.0
e NaN
dtype: float64
In [9]: s[s.argsort()] # nonsensical result
Out[9]:
b 100.0
a 200.0
e NaN
c 400.0
f 300.0
d 500.0
dtype: float64
In [10]: s[s.argsort().values] # nonsensical result
Out[10]:
b 100.0
a 200.0
e NaN
c 400.0
f 300.0
d 500.0
dtype: float64
In [11]: pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.4.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3.1
Cython: None
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: None
patsy: 0.4.1
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.4
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.1
html5lib: 1.0b8
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.8
boto: None