PERF: Why does specifying the index column in pandas significantly increases the read time of a csv? · Issue #44158 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this issue exists on the latest version of pandas.
- I have confirmed this issue exists on the master branch of pandas.
Reproducible Example
I am seeing a significantly increased read time for a CSV by pandas when I specify the index_col
. I do not understand the reason behind it. Can you help me understand why that is happening and if that is actually the expected behaviour? Below is the code I am using:
import pandas as pd
#save the CSV to be used
pd.DataFrame({'id':np.arange(100000000),'b':np.random.choice(['a','b','c','d'],size=(100000000,),p=[0.25,0.25,0.25,0.25])}).to_csv('df_sp.csv',index=None)
dfpd = pd.read_csv('df_sp.csv')
#read time - 10.3 seconds
dfpd = pd.read_csv('df_sp.csv',index_col='id')
#read time - 1 minute 38.6 seconds
In fact, I am seeing significant improvement if I read the dataset without specifying index_col
and then set the index by dfpd = dfpd.set_index('id')
. This takes just 1.6 more seconds. Why does pandas not default to always reading the dataframe with index_col
as a column and then setting it as the index internally with set_index(index_col)
when index_col
is specified?
Installed Versions
INSTALLED VERSIONS
commit : 73c6825
python : 3.9.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-89-generic
Version : #100-Ubuntu SMP Fri Sep 24 14:50:10 UTC 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_IN
LOCALE : en_IN.ISO8859-1
pandas : 1.3.3
numpy : 1.21.2
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.09.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.23
tables : None
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : None
numba : 0.53.1
Prior Performance
No response