BUG: Memory leak in json encoding for time related objects · Issue #40443 · pandas-dev/pandas (original) (raw)
- I have checked that this issue has not already been reported.
- I have confirmed this bug exists on the latest version of pandas.
- (optional) I have confirmed this bug exists on the master branch of pandas.
Hello,
While using pandas in my project, I saw the memory usage of my process raising. After some digging, it looks like there is a memory leak in the JSON encoding code.
This simple test should be able to reproduce the issue:
import pandas as pd
if name == "main": df = pd.DataFrame([[1,2], [1,2], [3,4], [3,4], [3,6], [5,6], [7,8], [7,8]], columns=["a", "b"], index = pd.date_range('1/1/2000', periods=8, freq='T')) for i in range(10000): str = df.to_json(orient="table") df = pd.read_json(str, orient="table")
Which ran using Valgrind should show that kind of result:
$ PYTHONMALLOC=malloc valgrind --leak-check=yes --track-origins=yes --log-file=valgrind-log.txt python test.py
...
==214631== 3,358,152 bytes in 79,956 blocks are definitely lost in loss record 15,015 of 15,015
==214631== at 0x483E77F: malloc (vg_replace_malloc.c:307)
==214631== by 0x4F811482: int64ToIso (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631== by 0x4F81364B: encode (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631== by 0x4F813514: encode (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631== by 0x4F8135F7: encode (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631== by 0x4F813514: encode (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631== by 0x4F813A80: JSON_EncodeObject (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631== by 0x4F811119: objToJSON (in /lib/python3.7/site-packages/pandas/_libs/json.cpython-37m-x86_64-linux-gnu.so)
==214631== by 0x4993B18: _PyMethodDef_RawFastCallKeywords (in /usr/lib/libpython3.7m.so.1.0)
==214631== by 0x4993713: _PyCFunction_FastCallKeywords (in /usr/lib/libpython3.7m.so.1.0)
==214631== by 0x499364C: ??? (in /usr/lib/libpython3.7m.so.1.0)
==214631== by 0x498E0CD: _PyEval_EvalFrameDefault (in /usr/lib/libpython3.7m.so.1.0)
...
Which points to the int64ToIso()
function in this case but mostly any function used in the getStringValue()
function is allocating memory and this memory appears to not be freed after that (if I'm not missing something).
value = enc->getStringValue(obj, &tc, &szlen); |
---|
It would be great if someone can confirm my deduction. If I'm right, I will try to submit a PR.
Thanks.
The environment I used:
pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 7d32926db8f7541c356066dcadabf854487738de
python : 3.7.10.final.0
python-bits : 64
OS : Linux
OS-release : 5.11.2-1-MANJARO
Version : #1 SMP PREEMPT Fri Feb 26 12:17:53 UTC 2021
machine : x86_64
processor :
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.2.2
numpy : 1.19.5