DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.0 · Issue #11668 · pandas-dev/pandas (original) (raw)

Dataframe.duplicated() and .drop_duplicates() are flagging rows as duplicates when they are in fact distinct.

This was the smallest dataset I could make to recreate the issue, but I've seen this issue on DataFrames of any size:

import pandas as pd

json_data = '{"Col1":{"0":"S2#OaGwWII","1":")A9$rw3W_I","2":"2Ra+_RWII","3":"2RA`4kRWII","4":"2R=K_RWII"},'
'"Col2":{"0":141105144406,"1":141107294517,"2":141106133624,"3":141108219194,"4":141106133614}}' df = pd.read_json(json_data) print(df)

     Col1          Col2

0 S2#OaGwWII 141105144406 1 )A9$rw3W_I 141107294517 2 2Ra+_RWII 141106133624 3 2RA`4kRWII 141108219194 4 2R=K_RWII 141106133614

df.duplicated(keep=False)

0 False 1 False 2 True 3 False 4 True dtype: bool

It also seems to depend on row order / column order; this behavior can be changed by shuffling / sampling rows or columns, e.g.:

df[['Col2', 'Col1']].duplicated(keep=False) 0 False 1 False 2 False 3 False 4 False dtype: bool

I only see this behavior on 0.17.0, while 0.16.2 is fine. More details about each environment are below:

pandas 0.17.0 / python 3.4.3 (failing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-400.1.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US

pandas: 0.17.0
nose: 1.3.4
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 3.2.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.6.6.None
psycopg2: 2.6 (dt dec pq3 ext)

pandas 0.17.0 / python 3.5 (failing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-348.18.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.17.0
nose: 1.3.7
pip: 7.1.2
setuptools: 18.4
Cython: 0.23.4
numpy: 1.10.1
scipy: 0.16.0
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.4
xlwt: None
xlsxwriter: 0.7.7
lxml: 3.4.4
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: 0.6.7.None
psycopg2: None

pandas 0.16.2 python 3.5 (passing)

>>> pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.18-348.18.1.el5
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C

pandas: 0.16.2
nose: None
Cython: None
numpy: 1.10.1
scipy: None
statsmodels: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None