df.sort_values() not respecting na_position with categoricals · Issue #22556 · pandas-dev/pandas (original) (raw)
Problem description
DataFrame.sort_values()
appears not to respect the na_position
parameter when sorting by a categorical series:
import pandas as pd c = pd.Categorical(['A', np.nan, 'B'], categories=['A','B'], ordered=True) df = pd.DataFrame({'c': c}) df.sort_values(by='c', na_position='first') c 1 NaN 0 A 2 B df.sort_values(by='c', na_position='last') c 1 NaN 0 A 2 B
Unexpectedly, the NaNs always come first regardless of na_position
.
Additional information
Series.sort_values()
works as expected:
c.sort_values(na_position='first') [NaN, A, B] Categories (2, object): [A < B] c.sort_values(na_position='last') [A, B, NaN] Categories (2, object): [A < B]
Strangely, df.sort_values()
does seem to respect na_position
if you sort by more than one column (even the same column):
df.sort_values(by=['c','c'], na_position='first') c 1 NaN 0 A 2 B df.sort_values(by=['c','c'], na_position='last') c 0 A 2 B 1 NaN
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.7.0.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.4
pytest: None
pip: 10.0.1
setuptools: 40.0.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None