drop_duplicates slow for data frame of boolean columns · Issue #12963 · pandas-dev/pandas (original) (raw)

In trying to find out the fast way to count unique rows in a data frame, I stumbled upon an issue where dropping duplicates from a data frame of booleans is quite slow. @jreback asked me to open a ticket.

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: N = 100000

In [4]: np.random.seed(1234)

In [5]: df = pd.DataFrame({str(i) : [bool(x) for x in np.random.randint(0,2,size=N)] for i in range(500)})

In [6]: %timeit df.drop_duplicates()
1 loop, best of 3: 3.01 s per loop

In [7]: %timeit df.astype(int).drop_duplicates()
1 loop, best of 3: 1.4 s per loop
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.39.0