DataFrame sort_values and multiple "by" columns fails to order NaT correctly (since v0.19) (original) (raw)

Code Used to Demonstrate Issue

import pandas as pd import numpy as np import datetime as dt import random

print 'Pandas Version', pd.version

now = dt.datetime.today() d1 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,3)] * 2 d2 = [now + dt.timedelta(days=random.randint(0, 30)) for i in range(0,6)] d1[3] = np.nan

print "\nSort on Dates" df_t = pd.DataFrame({'a': d1, 'b':d2}) print "column type:", df_t.a.dtype print "\npresorted:" print df_t print "\nsort by=['a', 'b'], na_position='last':" print df_t.sort_values(by=['a', 'b'], na_position='last') print "\nsort by=['a'], na_position='last':" print df_t.sort_values(by=['a'], na_position='last')

Problem description

Considering sort_values(by=['a', 'b'], na_position='last')
In v0.19.2 this works as expected:

In v0.20.2 this does not work as expected, with NaT in 'a' appearing first. Sorting only on 'a' does put NaT last.

Expected Output - as observed for v0.19

Sort on Dates
column type: datetime64[ns]

presorted:
                        a                       b
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705

sort by=['a', 'b'], na_position='last':
                        a                       b
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705

sort by=['a'], na_position='last':
                        a                       b
1 2017-07-06 12:14:47.705 2017-08-01 12:14:47.705
4 2017-07-06 12:14:47.705 2017-07-29 12:14:47.705
2 2017-07-24 12:14:47.705 2017-07-29 12:14:47.705
5 2017-07-24 12:14:47.705 2017-07-13 12:14:47.705
0 2017-07-25 12:14:47.705 2017-07-13 12:14:47.705
3                     NaT 2017-07-16 12:14:47.705

Bad Output - observed for v0.20

Sort on Dates
column type: datetime64[ns]

presorted:
                        a                       b
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
3                     NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654

sort by=['a', 'b'], na_position='last':
                        a                       b
3                     NaT 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654

sort by=['a'], na_position='last':
                        a                       b
1 2017-07-09 12:13:02.654 2017-07-15 12:13:02.654
4 2017-07-09 12:13:02.654 2017-07-13 12:13:02.654
2 2017-07-27 12:13:02.654 2017-07-17 12:13:02.654
5 2017-07-27 12:13:02.654 2017-07-11 12:13:02.654
0 2017-08-03 12:13:02.654 2017-07-31 12:13:02.654
3                     NaT 2017-07-15 12:13:02.654

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.2
pytest: 2.8.5
pip: 8.1.1
setuptools: 35.0.2
Cython: 0.23.4
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 4.1.1
sphinx: 1.4
patsy: 0.4.1
dateutil: 2.5.0
pytz: 2016.3
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: 0.9999999
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None
None