BUG: Subsequent calls to df.sub() are much faster than the first call · Issue #34297 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
import numpy as np import pandas as pd
Building some general structure
my_date_range = pd.date_range('20200101 00:00', '20200102 0:00', freq='S') level_0_names = list(str(i) for i in range(30)) #level_0_names = list(range(30)) index = pd.MultiIndex.from_product([level_0_names, my_date_range]) column_names = ['col_1', 'col_2']
Building a df that represents some value over time (think sensors)
Indexed by sensor and time
value_df = pd.DataFrame(np.random.rand(len(index),2), index=index, columns=column_names)
Build a reference df for the reference value the sensor can take (like its max)
Indexed by sensor
ref_df = pd.DataFrame(np.random.randint(1, 10, (len(level_0_names), 2)), index = level_0_names, columns=column_names)
We now want to consider for each time index in value_df what is the deviation of the value observed wrt to the ref value
In a notebook, this first execution will be slow: 8-10s on my machine
%%time
value_df.sub(ref_df, level=0)
This second execution will be fast: 100-150ms
%%time
value_df.sub(ref_df, level=0)
For reference, this is NOT the problem, the following lines would produce the same output
On my machine it takes ~2s
%%time
same_w_merge = pd.merge(left = value_df.reset_index(level=1), right = ref_df, right_index=True, left_index=True) same_w_merge['col_1_x'] -= same_w_merge['col_1_y'] same_w_merge['col_2_x'] -= same_w_merge['col_2_y'] same_w_merge = same_w_merge.drop(columns = ['col_1_y', 'col_2_y']) same_w_merge = same_w_merge.rename({'col_1_x':'col_1', 'col_2_x': 'col_2'}) same_w_merge = same_w_merge.set_index('level_1', append=True).sort_index()
Problem description
There is a significant difference in speed between the first and second call to sub
(which are the same instruction) in the code above. I don't understand where this is coming from. In particular why this is notably slower than merge (whose performance remains consistent).
Upon investigation, I noticed that the difference between runs is much smaller if value_df.index.level[0] is of type int (80ms for the first run 60ms for the subsequent)
Expected Output
Current output is correct, speed of first call is the issue here
Output of pd.show_versions()
Bug reproduced here on a conda/ OS X install for simplicity but can confirm it exists as well in a Ubuntu Based Docker
INSTALLED VERSIONS
commit : None
python : 3.7.7.final.0
python-bits : 64
OS : Darwin
OS-release : 18.7.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8
pandas : 1.0.3
numpy : 1.18.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.4.0.post20200518
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None