read_json with lines=True not using buff/cache memory · Issue #17048 · pandas-dev/pandas (original) (raw)
I have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.
I'm on Ubuntu, and the free
command shows >12GB of "Available" memory, most of which is "buff/cache".
I'm able to read the file into a dataframe by iterating over the file like so:
dfs = [] with open(fp, 'r') as f: while True: lines = list(itertools.islice(f, 1000))
if lines:
lines_str = ''.join(lines)
dfs.append(pd.read_json(StringIO(lines_str), lines=True))
else:
break
df = pd.concat(dfs)
You'll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.
It seems that pd.read_json
with lines=True
doesn't use the available memory, which looks to me like a bug.