BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' · Issue #46202 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

EXAMPLE 1

import numpy as np import pandas as pd import time

t0 = time.time() col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)] col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True) col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"]) print(time.time()-t0)

t0 = time.time() processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack() print(time.time()-t0)

EXAMPLE 2

import numpy as np import pandas as pd import time

t0 = time.time() col1_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 20)) for _ in range(700000)] col2_possible_values = ["".join(np.random.choice(list("ABCDEFGHIJKLMNOPRSTUVWXYZ"), 10)) for _ in range(860)]

col1_values = np.random.choice(col1_possible_values, size=13000000, replace=True) col2_values = np.random.choice(col2_possible_values, size=13000000, replace=True)

sample_df = pd.DataFrame(zip(col1_values, col2_values), columns=["col1", "col2"]) sample_df['col2'] = sample_df['col2'].astype('category') print(time.time()-t0)

t0 = time.time() processed_df = sample_df.groupby("col1")["col2"].value_counts().unstack() print(time.time()-t0)

Issue Description

I have 2022 Macbook Pro M1 Pro, pandas 1.4.1, numpy 1.22.2

I noticed significant performance drop when trying to perform

sample_df.groupby("col1")["col2"].value_counts().unstack()

when col2 or both col1 and col2 are of type 'category' instead of default type 'object'. Operation in EXAMPLE 1 runs around ~25 seconds on my computer (similar for 2019 Macbook Pro with Intel processor). In EXAMPLE 2 I have run the operation for more than 20 minutes and it still did not finish (on 2019 Macbook Pro with Intel processor running time is similar for EXAMPLE 1 and EXAMPLE 2).

Moreover, for M1 in EXAMPLE 1, the peak memory usage is around 8-10GB while for EXAMPLE 2 it well exceeds 30GB.

Expected Behavior

on M1:

EXAMPLE 1 and EXAMPLE 2 should perform roughly the same as they do on Intel processor.

Installed Versions

INSTALLED VERSIONS

commit : 06d2301
python : 3.9.10.final.0
python-bits : 64
OS : Darwin
OS-release : 21.3.0
Version : Darwin Kernel Version 21.3.0: Wed Jan 5 21:37:58 PST 2022; root:xnu-8019.80.24~20/RELEASE_ARM64_T6000
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.4.1
numpy : 1.22.2
pytz : 2021.3
dateutil : 2.8.2
pip : 22.0.3
setuptools : 60.9.3
Cython : 0.29.28
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 7.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
None