PERF: isin() is slower for categorical data than for integers · Issue #20003 · pandas-dev/pandas (original) (raw)

Problem description

For long series and many categories 'Series.isin()' is slower for categorical data rather than for int64. If categories are built from strings, then the degradation of the performance is even larger.

import pandas as pd import numpy as np

N = 3000000 Ncats = 100

cats = pd.Series(['abcdef%d'%_ for _ in range(Ncats)])

df = pd.DataFrame({'A': np.random.randn(N), 'B': np.random.randn(N), 'C': np.random.randint(0, Ncats, N), }) df['D'] = cats.loc[df['C'].values].values df['E'] = df['C'].astype('category') df['F'] = df['D'].astype('category')

sel_codes = [1,2] sel_cats = cats.loc[sel_codes].values

%timeit inds = df.C.isin(sel_codes) # int64 %timeit inds = df.E.isin(sel_codes) # category based on int64 %timeit inds = df.D.isin(sel_cats) # object / string %timeit inds = df.F.isin(sel_cats) # category based on string

On my machine:

6.25 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
28.7 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
142 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Interestingly, if there're many categories to compare with, categorical data is faster, e.g. for

sel_codes = range(90) sel_cats = cats.loc[sel_codes].values

%timeit inds = df.C.isin(sel_codes) # int64 %timeit inds = df.E.isin(sel_codes) # category based on int64 %timeit inds = df.D.isin(sel_cats) # object / string %timeit inds = df.F.isin(sel_cats) # category based on string

the timings are:

441 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
422 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
147 ms ± 7.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
171 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

p.s. I'm not sure if such performance issues are worth filing.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None