PERF: Unreliable performance of .loc
with big non-unique index dataframe · Issue #54550 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this issue exists on the latest version of pandas.
- I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
import random import time
import pandas as pd
if name == "main": # Create a large pandas dataframe with non-unique indexes. table_size = 10_000_000 num_index = 1_000_000 df = pd.DataFrame([1] * table_size) index = random.choices(range(num_index), k=table_size) df.index = index df = df.sort_index()
# Seems like we need to pre-query the index to force optimizations.
df.loc[[0, 1, 2, 131, 341414]]
df.loc[[3434]]
for i in range(10):
indexes = random.sample(df.index.tolist(), k=i+1)
start = time.monotonic()
df.loc[indexes]
measure = time.monotonic() - start
print(f"With all at once: num_indexes={i+1} => {measure:.5f}s")
print("---")
for i in range(10):
indexes = random.sample(df.index.tolist(), k=i+1)
start = time.monotonic()
pd.concat([df.loc[[idx]] for idx in indexes])
measure = time.monotonic() - start
print(f"With one at a time: num_indexes={i+1} => {measure:.5f}s")
Hello everyone,
In this example, I create a big DataFrame
with non-unique indexes. I then make sure the index is sorted and that it's been already queried (You could confirm, but since the first calls to loc
are always slower, I assume pandas is doing some kind of lazy optimization behind).
Then the problem. I get very good performance (~0.5ms) until I ask for more than 4 indexes, then it drastically increases to ~500ms (X1000). If I instead index one at a time and concat
after, performance stays in the same order of magnitude (there is a small linear increase from the loop+concat of course and it does get weird for 4...). Here is the output for pandas version 2.1.0rc0+10.gfc308235f
(but the same problem arises with pandas==2.0.3 and 1.5.3):
With all at once: num_indexes=1 => 0.00039s
With all at once: num_indexes=2 => 0.00045s
With all at once: num_indexes=3 => 0.00041s
With all at once: num_indexes=4 => 0.00041s
With all at once: num_indexes=5 => 0.47082s
With all at once: num_indexes=6 => 0.48561s
With all at once: num_indexes=7 => 0.47594s
With all at once: num_indexes=8 => 0.47567s
---
With one at a time: num_indexes=1 => 0.00066s
With one at a time: num_indexes=2 => 0.00086s
With one at a time: num_indexes=3 => 0.00103s
With one at a time: num_indexes=4 => 0.05667s
With one at a time: num_indexes=5 => 0.00156s
With one at a time: num_indexes=6 => 0.00160s
With one at a time: num_indexes=7 => 0.00181s
With one at a time: num_indexes=8 => 0.00200s
I would expect loc
to have steady performance since, in my understanding, it should act close to a hashmap. Even if it is not the case, I think pandas would definitely benefit from a stable way of querying multiple indexes.
Thanks in advance for the help!
Installed Versions
INSTALLED VERSIONS
commit : fc30823
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.6-76060206-generic
Version : #202303130630168547333822.04~995127e SMP PREEMPT_DYNAMIC Tue M
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.1.0rc0+10.gfc308235f
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
brotli : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Prior Performance
No response