groupby().first() much slower with a str column present in the data. · Issue #19283 · pandas-dev/pandas (original) (raw)

(I copied this code from a jupyter notebook)

import pandas as pd import sys pd.options.display.max_rows = 10 print('pandas version', pd.version) print('python version', sys.version)

#pandas version 0.22.0 #python version 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]

msgs = pd.DataFrame({ 'orderid':pd.np.random.random_sample(size=100000) ,'qty':pd.np.random.random_sample(size=100000)}) msgs['date'] = '1900-01-01' msgs['textcol'] = 'lorem ipsum etc' msgs.info()

omits textcol in data takes 59 ms

g = msgs[['date','orderid','qty']].groupby(['date','orderid']) %time orders = g.first() orders.info(null_counts=True)

has textcol in data takes 10.6 s

g = msgs.groupby(['date','orderid']) %time orders = g.first() orders.info(null_counts=True)

Problem description

I find that the presence of a text column in a dataframe's data (i.e. not the groupby) dramatically slows down a groupby.first() in version 0.22 (but not 0.21.1) by 2 orders of magnitude. The operation takes 59 ms without a text column present in the data and 10.6 secs when it is. (The problem is not limited to this kind of made-up data; I discovered it in my work after upgrading pandas.)

Expected Output

When I run the same code under 0.21.1 the times are 55 ms and 67 ms.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: 0.26.1
numpy: 1.13.3
scipy: 0.19.1
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.0
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.3.2 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None