DataFrame groupby.first() is much slower than groupby.nth(0) on categorical dtypes · Issue #25397 · pandas-dev/pandas (original) (raw)

groupby.first() is much slower than groupby.nth(0) for categorical columns in a very specific way.

Consider carefully the example below, where a dataframe has two columns c1 and c2. The number of unique values in the c1 column (regardless of its datatype) increases the runtime of first when and only when c2 is a categorical column.

import pandas as pd, numpy as np, timeit

def test(N_CATEGORIES, cat_cols = ['c2']): # creates dataframe df with categorical column c1, optional categorical column c2. # Times how long grouping by c1 and calling nth(0) and first takes global df print(N_CATEGORIES) df = pd.DataFrame({ 'c1':np.arange(0,10000)%N_CATEGORIES, 'c2':np.arange(0,10000) }) for col in cat_cols: df[col] = df[col].astype('category') t_nth = timeit.timeit("x1 =df.groupby(['c1']).nth(0, dropna='all')", setup="from main import df", number=1) t_first = timeit.timeit("x2 = df.groupby(['c1']).first()", setup="from main import df", number=1) return t_nth, t_first

test_N_categories = [1,10,100,1000,10000]

Test when column c2 is categorical

results_c2_cat = pd.DataFrame([test(N, cat_cols=['c2']) for N in test_N_categories], index=test_N_categories, columns=['nth','first'])

Test when column c2 is not categorical

results_c2_not_cat = pd.DataFrame([test(N, cat_cols=[]) for N in test_N_categories], index=test_N_categories, columns=['nth','first'])

print(results_c2_cat) print(results_c2_not_cat)

The results are

print(results_c2_cat) nth first 1 0.004204 0.010677 10 0.005890 0.015701 100 0.005305 0.052130 1000 0.004365 0.581036 10000 0.004358 2.847709 print(results_c2_not_cat) nth first 1 0.003479 0.001110 10 0.003027 0.000993 100 0.003297 0.001089 1000 0.003297 0.001057 10000 0.003952 0.001382

As shown, when the column c2 is categorical, the runtime of first grows very rapidly as a function of the number of unique values in the column c1 (and thus the "width" of the groupby).

first and last both suffer from this problem.

(see also #19283 and #19598)

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.24.1
pytest: None
pip: 19.0.1
setuptools: 39.0.1
Cython: None
numpy: 1.15.3
scipy: 1.1.0
pyarrow: 0.12.0
xarray: 0.11.0
IPython: None
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.6
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: 0.4.0
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.2.6
bs4: 4.6.3
html5lib: None
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None