PERF: Series.fillna (that is part of DataFrame) with inplace=True high memory usage / memory leaking · Issue #46149 · pandas-dev/pandas (original) (raw)
Pandas version checks
- I have checked that this issue has not already been reported.
- I have confirmed this issue exists on the latest version of pandas.
- I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
I have recently migrated from pandas 1.1.3
to pandas 1.4.1
and I'm experiencing some memory-related issues. The code that used to work just fine is now crashing due to memory limitations.
Reproducible example:
import pandas as pd
import numpy as np
arr = np.full((500, 400), np.nan)
df = pd.DataFrame(arr)
for col in df.columns:
df[col].fillna(-1, inplace=True)
Interestingly enough, this snippet, which I would expect to have bigger memory requirements (as it's not inplace), works just fine:
import pandas as pd
import numpy as np
df = pd.DataFrame([[np.nan] * 400]*500000)
for col in df.columns:
df[col] = df[col].fillna(-1)
The real code is obviously more complicated than the example, so I'd like to keep filling NA's column by column and doing this in place. Is that an unintended way of using fillna
for DataFrame
's Series
or is it a bug?
I don't have any memory profiling applied that I could share, but I can try making one if that's necessary. I've just noticed the code work in old pandas version and doesn't work in the new one, and I can see the memory usage in system's resource monitor.
Installed Versions
INSTALLED VERSIONS
------------------
commit : 06d230151e6f18fdb8139d09abf539867a8cd481
python : 3.8.3.final.0
python-bits : 64
OS : Darwin
OS-release : 20.4.0
Version : Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 1.4.1
numpy : 1.21.0
pytz : 2020.1
dateutil : 2.8.1
pip : 22.0.3
setuptools : 41.2.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.6
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : 1.3.2
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.3.2
numba : 0.53.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 2.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.5.3
sqlalchemy : 1.4.11
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None
Prior Performance
In pandas 1.1.3 the first snippet works just fine.
Both snippets also seem to work incomparably faster on the older pandas version.