PERF: isin() is slower for categorical data than for integers · Issue #20003 · pandas-dev/pandas (original) (raw)
Problem description
For long series and many categories 'Series.isin()' is slower for categorical
data rather than for int64
. If categories are built from strings, then the degradation of the performance is even larger.
import pandas as pd import numpy as np
N = 3000000 Ncats = 100
cats = pd.Series(['abcdef%d'%_ for _ in range(Ncats)])
df = pd.DataFrame({'A': np.random.randn(N), 'B': np.random.randn(N), 'C': np.random.randint(0, Ncats, N), }) df['D'] = cats.loc[df['C'].values].values df['E'] = df['C'].astype('category') df['F'] = df['D'].astype('category')
sel_codes = [1,2] sel_cats = cats.loc[sel_codes].values
%timeit inds = df.C.isin(sel_codes) # int64 %timeit inds = df.E.isin(sel_codes) # category based on int64 %timeit inds = df.D.isin(sel_cats) # object / string %timeit inds = df.F.isin(sel_cats) # category based on string
On my machine:
6.25 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
28.7 ms ± 2.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
104 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
142 ms ± 6.88 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Interestingly, if there're many categories to compare with, categorical data is faster, e.g. for
sel_codes = range(90) sel_cats = cats.loc[sel_codes].values
%timeit inds = df.C.isin(sel_codes) # int64 %timeit inds = df.E.isin(sel_codes) # category based on int64 %timeit inds = df.D.isin(sel_cats) # object / string %timeit inds = df.F.isin(sel_cats) # category based on string
the timings are:
441 ms ± 61 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
422 ms ± 68.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
147 ms ± 7.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
171 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
p.s. I'm not sure if such performance issues are worth filing.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Darwin
OS-release: 17.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 28.8.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 1.0b10
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None