DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes · Issue #25397 · pandas-dev/pandas (original) (raw)
groupby.first()
is much slower than groupby.nth(0)
for categorical columns in a very specific way.
Consider carefully the example below, where a dataframe has two columns c1
and c2
. The number of unique values in the c1
column (regardless of its datatype) increases the runtime of first
when and only when c2
is a categorical column.
import pandas as pd, numpy as np, timeit
def test(N_CATEGORIES, cat_cols = ['c2']): # creates dataframe df with categorical column c1, optional categorical column c2. # Times how long grouping by c1 and calling nth(0) and first takes global df print(N_CATEGORIES) df = pd.DataFrame({ 'c1':np.arange(0,10000)%N_CATEGORIES, 'c2':np.arange(0,10000) }) for col in cat_cols: df[col] = df[col].astype('category') t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from main import df", number=1) t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from main import df", number=1) return t_nth, t_first
test_N_categories = [1,10,100,1000,10000]
Test when column c2 is categorical
results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories], index=test_N_categories, columns=['nth','first'])
Test when column c2 is not categorical
results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories], index=test_N_categories, columns=['nth','first'])
print(results_c2_cat) print(results_c2_not_cat)
The results are
print(results_c2_cat) nth first 1 0.004204 0.010677 10 0.005890 0.015701 100 0.005305 0.052130 1000 0.004365 0.581036 10000 0.004358 2.847709 print(results_c2_not_cat) nth first 1 0.003479 0.001110 10 0.003027 0.000993 100 0.003297 0.001089 1000 0.003297 0.001057 10000 0.003952 0.001382
As shown, when the column c2
is categorical, the runtime of first
grows very rapidly as a function of the number of unique values in the column c1
(and thus the "width" of the groupby).
first
and last
both suffer from this problem.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: None
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None