0.24.0 vs 0.23.4: scalar + DataFrame is 3000x slower · Issue #24990 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
First, two conda
environments:
name: pandas_timing_0_23
dependencies:
- python=3.7
- pandas=0.23
name: pandas_timing_0_24
dependencies:
- python=3.7
- pandas=0.24
Now, a generalized script testing all 8 combinations of scalar
+ ndframe
where scalar in [0, 0.0]
and ndframe
is a Series
or DataFrame
with 5000 NaNs
; combinations reverse the order too.
from itertools import product from os import system
from pandas import show_versions
if name == "main": show_versions() print()
for scalar, ndframe in product(
["0", "0.0"],
[
"pd.Series(np.nan, index=range(5000), dtype=float)",
"pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)",
],
):
for left, right in [(scalar, ndframe), (ndframe, scalar)]:
stmt = f"{left} + {right}"
print(f"Timing {stmt!r}...")
system(
f"""python -m timeit --setup='import numpy as np; import pandas as pd;' {stmt!r}"""
)
print()
Problem description
A table of timings:
+-----------+-----------+--------+----+--------+----+----------+--------------+
| | | 0.23.4 | | 0.24.0 | | % slower | times slower |
+-----------+-----------+--------+----+--------+----+----------+--------------+
| int | Series | 102 | us | 140 | us | 37.3 | |
| Series | int | 98.7 | us | 140 | us | 41.8 | |
| int | DataFrame | 188 | us | 592 | ms | | 3149 |
| DataFrame | int | 187 | us | 588 | ms | | 3144 |
| float | Series | 97 | us | 140 | us | 44.3 | |
| Series | float | 102 | us | 138 | us | 35.3 | |
| float | DataFrame | 176 | us | 609 | ms | | 3460 |
| DataFrame | float | 185 | us | 591 | ms | | 3195 |
+-----------+-----------+--------+----+--------+----+----------+--------------+
These are collated from the following:
0.23.4
Timing '0 + pd.Series(np.nan, index=range(5000), dtype=float)'...
5000 loops, best of 5: 102 usec per loop
Timing 'pd.Series(np.nan, index=range(5000), dtype=float) + 0'...
5000 loops, best of 5: 98.7 usec per loop
Timing '0 + pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)'...
2000 loops, best of 5: 188 usec per loop
Timing 'pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) + 0'...
2000 loops, best of 5: 187 usec per loop
Timing '0.0 + pd.Series(np.nan, index=range(5000), dtype=float)'...
5000 loops, best of 5: 97 usec per loop
Timing 'pd.Series(np.nan, index=range(5000), dtype=float) + 0.0'...
5000 loops, best of 5: 102 usec per loop
Timing '0.0 + pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)'...
2000 loops, best of 5: 176 usec per loop
Timing 'pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) + 0.0'...
2000 loops, best of 5: 185 usec per loop
0.24.0
Timing '0 + pd.Series(np.nan, index=range(5000), dtype=float)'...
2000 loops, best of 5: 140 usec per loop
Timing 'pd.Series(np.nan, index=range(5000), dtype=float) + 0'...
2000 loops, best of 5: 140 usec per loop
Timing '0 + pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)'...
1 loop, best of 5: 592 msec per loop
Timing 'pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) + 0'...
1 loop, best of 5: 588 msec per loop
Timing '0.0 + pd.Series(np.nan, index=range(5000), dtype=float)'...
2000 loops, best of 5: 140 usec per loop
Timing 'pd.Series(np.nan, index=range(5000), dtype=float) + 0.0'...
2000 loops, best of 5: 138 usec per loop
Timing '0.0 + pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float)'...
1 loop, best of 5: 609 msec per loop
Timing 'pd.DataFrame(np.nan, index=range(1), columns=range(5000), dtype=float) + 0.0'...
1 loop, best of 5: 591 msec per loop
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
Output of pd.show_versions()
0.23.4
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
0.24.0
INSTALLED VERSIONS
commit: None
python: 3.7.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.0
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None