df.sort_values() not respecting na_position with categoricals · Issue #22556 · pandas-dev/pandas (original) (raw)

Problem description

DataFrame.sort_values() appears not to respect the na_position parameter when sorting by a categorical series:

import pandas as pd c = pd.Categorical(['A', np.nan, 'B'], categories=['A','B'], ordered=True) df = pd.DataFrame({'c': c}) df.sort_values(by='c', na_position='first') c 1 NaN 0 A 2 B df.sort_values(by='c', na_position='last') c 1 NaN 0 A 2 B

Unexpectedly, the NaNs always come first regardless of na_position.

Additional information

Series.sort_values() works as expected:

c.sort_values(na_position='first') [NaN, A, B] Categories (2, object): [A < B] c.sort_values(na_position='last') [A, B, NaN] Categories (2, object): [A < B]

Strangely, df.sort_values() does seem to respect na_position if you sort by more than one column (even the same column):

df.sort_values(by=['c','c'], na_position='first') c 1 NaN 0 A 2 B df.sort_values(by=['c','c'], na_position='last') c 0 A 2 B 1 NaN

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None