DataFrame .duplicated() / .drop_duplicates() flagging unique rows as duplicated in 0.17.1 · Issue #11864 · pandas-dev/pandas (original) (raw)
Dataframe.duplicated() is flagging rows as duplicates when they are in fact distinct. This happens when using large dataframes, and duplicated(keep=False):
import pandas as pd, numpy as np
df = pd.DataFrame({'a': pd.Series(range(1,100000)),
'b': pd.Series(range(10,1000000)),
'c': pd.Series(3*range(2,200000,2))})
df.head()
np.sum(df.duplicated())
Out[]: 0
np.sum(df.duplicated(keep=False))
Out[]:110
Changing column order results in different (but still incorrect) behavior.
np.sum(df[['c','b','a']].duplicated(keep=False))
Out[]:2138
Tested on 0.17.1. Environment details are provided below:
>> pd.util.print_versions.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
pandas: 0.17.1
nose: None
pip: 7.1.2
setuptools: 18.5
Cython: None
numpy: 1.10.1
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: 1.3.3
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.9
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
Jinja2: None
This looks like the same kind of problem described in #11668, though the specific examples provided in that issue work properly in 0.17.1