read_json with lines=True not using buff/cache memory · Issue #17048 · pandas-dev/pandas (original) (raw)

I have a 3.2 GB json file that I am trying to read into pandas using pd.read_json(lines=True). When I run that, I get a MemoryError, even though my system has >12GB of available memory. This is Pandas version 0.20.2.

I'm on Ubuntu, and the free command shows >12GB of "Available" memory, most of which is "buff/cache".

I'm able to read the file into a dataframe by iterating over the file like so:

dfs = [] with open(fp, 'r') as f: while True: lines = list(itertools.islice(f, 1000))

    if lines:
        lines_str = ''.join(lines)
        dfs.append(pd.read_json(StringIO(lines_str), lines=True))
    else:
        break

df = pd.concat(dfs)

You'll notice that at the end of this I have the original data in memory twice (in the list and in the final df), but no problems.

It seems that pd.read_json with lines=True doesn't use the available memory, which looks to me like a bug.