DataFrame sort_values and multiple "by" columns fails to order NaT correctly (since v0.19) (original) (raw)
Code Used to Demonstrate Issue
import pandas as pd import numpy as np import datetime as dt import random
print 'Pandas Version', pd.version
now = dt.datetime.today() d1 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,3)] * 2 d2 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,6)] d1[3] = np.nan
print "\nSort on Dates" df_t = pd.DataFrame({'a': d1, 'b':d2}) print "column type:", df_t.a.dtype print "\npresorted:" print df_t print "\nsort by=['a', 'b'], na_position='last':" print df_t.sort_values(by=['a', 'b'], na_position='last') print "\nsort by=['a'], na_position='last':" print df_t.sort_values(by=['a'], na_position='last')
Problem description
Considering sort_values(by=['a', 'b'], na_position='last')
In v0.19.2 this works as expected:
- primary sort on a, putting NaT last
- then secondary sort on b
In v0.20.2 this does not work as expected, with NaT in 'a' appearing first. Sorting only on 'a' does put NaT last.
Expected Output - as observed for v0.19
Sort on Dates
column type: datetime64[ns]
presorted:
a b
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
3 NaT 2017-07-16 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
sort by=['a', 'b'], na_position='last':
a b
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3 NaT 2017-07-16 12:14:47.705
sort by=['a'], na_position='last':
a b
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3 NaT 2017-07-16 12:14:47.705
Bad Output - observed for v0.20
Sort on Dates
column type: datetime64[ns]
presorted:
a b
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
3 NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
sort by=['a', 'b'], na_position='last':
a b
3 NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
sort by=['a'], na_position='last':
a b
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
3 NaT 2017-07-15 12:13:02.654
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None
pandas: 0.20.2
pytest: 2.8.5
pip: 8.1.1
setuptools: 35.0.2
Cython: 0.23.4
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 4.1.1
sphinx: 1.4
patsy: 0.4.1
dateutil: 2.5.0
pytz: 2016.3
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
None