PERF: Unreliable performance of .loc with big non-unique index dataframe · Issue #54550 · pandas-dev/pandas (original) (raw)

Pandas version checks

Reproducible Example

import random import time

import pandas as pd

if name == "main": # Create a large pandas dataframe with non-unique indexes. table_size = 10_000_000 num_index = 1_000_000 df = pd.DataFrame([1] * table_size) index = random.choices(range(num_index), k=table_size) df.index = index df = df.sort_index()

# Seems like we need to pre-query the index to force optimizations.
df.loc[[0, 1, 2, 131, 341414]]
df.loc[[3434]]

for i in range(10):
    indexes = random.sample(df.index.tolist(), k=i+1)
    start = time.monotonic()
    df.loc[indexes]
    measure = time.monotonic() - start
    print(f"With all at once: num_indexes={i+1} => {measure:.5f}s")

print("---")

for i in range(10):
    indexes = random.sample(df.index.tolist(), k=i+1)
    start = time.monotonic()
    pd.concat([df.loc[[idx]] for idx in indexes])
    measure = time.monotonic() - start
    print(f"With one at a time: num_indexes={i+1} => {measure:.5f}s")

Hello everyone,

In this example, I create a big DataFrame with non-unique indexes. I then make sure the index is sorted and that it's been already queried (You could confirm, but since the first calls to loc are always slower, I assume pandas is doing some kind of lazy optimization behind).

Then the problem. I get very good performance (~0.5ms) until I ask for more than 4 indexes, then it drastically increases to ~500ms (X1000). If I instead index one at a time and concat after, performance stays in the same order of magnitude (there is a small linear increase from the loop+concat of course and it does get weird for 4...). Here is the output for pandas version 2.1.0rc0+10.gfc308235f (but the same problem arises with pandas==2.0.3 and 1.5.3):

With all at once: num_indexes=1 => 0.00039s                                                                                             
With all at once: num_indexes=2 => 0.00045s                                                                                             
With all at once: num_indexes=3 => 0.00041s                                                                                             
With all at once: num_indexes=4 => 0.00041s                                                                                             
With all at once: num_indexes=5 => 0.47082s                                                                                             
With all at once: num_indexes=6 => 0.48561s                                                                                             
With all at once: num_indexes=7 => 0.47594s                                                                                             
With all at once: num_indexes=8 => 0.47567s                                                                                             
---                                                                                                                                     
With one at a time: num_indexes=1 => 0.00066s                                                                                           
With one at a time: num_indexes=2 => 0.00086s                                                                                           
With one at a time: num_indexes=3 => 0.00103s                                                                                           
With one at a time: num_indexes=4 => 0.05667s                                                                                           
With one at a time: num_indexes=5 => 0.00156s                                                                                           
With one at a time: num_indexes=6 => 0.00160s                                                                                           
With one at a time: num_indexes=7 => 0.00181s                                                                                           
With one at a time: num_indexes=8 => 0.00200s

I would expect loc to have steady performance since, in my understanding, it should act close to a hashmap. Even if it is not the case, I think pandas would definitely benefit from a stable way of querying multiple indexes.

Thanks in advance for the help!

Installed Versions

INSTALLED VERSIONS

commit : fc30823
python : 3.10.6.final.0
python-bits : 64
OS : Linux
OS-release : 6.2.6-76060206-generic
Version : #202303130630168547333822.04~995127e SMP PREEMPT_DYNAMIC Tue M
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.1.0rc0+10.gfc308235f
numpy : 1.25.2
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.8.0
pip : 23.1.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
bs4 : None
bottleneck : None
brotli : None
dataframe-api-compat: None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None

Prior Performance

No response