MemoryError with more than 1E9 rows (original) (raw)

I have 240GB of RAM. Nothing else running on the machine. I'm trying to create 1.5E9 rows, which I think should create a data frame of around 100GB, but getting this MemoryError. This works fine with 1E9 but not 1.5E9. I could understand a limit at about 2^31 (2E9) or 2^32 (4E9) but all 240GB seems exhausted (according to htop) at somewhere between 1E9 and 1.5E9 rows. Any ideas? Thanks.

$ python3 Python 3.4.0 (default, Apr 11 2014, 13:05:11) [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information.

import pandas as pd import numpy as np import timeit pd.version '0.14.1' def randChar(f, numGrp, N) : ... things = [f%x for x in range(numGrp)] ... return [things[x] for x in np.random.choice(numGrp, N)] ... def randFloat(numGrp, N) : ... things = [round(100*np.random.random(),4) for x in range(numGrp)] ... return [things[x] for x in np.random.choice(numGrp, N)] ... N=int(1.5e9) # N=int(1e9) works fine K=100 DF = pd.DataFrame({ ... 'id1' : randChar("id%03d", K, N), # large groups (char) ... 'id2' : randChar("id%03d", K, N), # large groups (char) ... 'id3' : randChar("id%010d", N//K, N), # small groups (char) ... 'id4' : np.random.choice(K, N), # large groups (int) ... 'id5' : np.random.choice(K, N), # large groups (int) ... 'id6' : np.random.choice(N//K, N), # small groups (int) ... 'v1' : np.random.choice(5, N), # int in range [1,5] ... 'v2' : np.random.choice(5, N), # int in range [1,5] ... 'v3' : randFloat(100,N) # numeric e.g. 23.5749 ... }) Traceback (most recent call last): File "", line 10, in File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 203, in init mgr = self._init_dict(data, index, columns, dtype=dtype) File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 327, in _init_dict dtype=dtype) File "/usr/lib/python3/dist-packages/pandas/core/frame.py", line 4630, in arrays_to_mgr return create_block_manager_from_arrays(arrays, arr_names, axes) File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3235, in create_block_manager_from_arrays blocks = form_blocks(arrays, names, axes) File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3322, in form_blocks object_items, np.object) File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3346, in _simple_blockify values, placement = _stack_arrays(tuples, dtype) File "/usr/lib/python3/dist-packages/pandas/core/internals.py", line 3410, in _stack_arrays stacked = np.empty(shape, dtype=dtype) MemoryError

$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Stepping: 4 CPU MHz: 2494.070 BogoMIPS: 5054.21 Hypervisor vendor: Xen Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 $ free -h total used free shared buffers cached Mem: 240G 2.3G 237G 364K 66M 632M -/+ buffers/cache: 1.6G 238G Swap: 0B 0B 0B $

An earlier question on S.O. is here : http://stackoverflow.com/questions/25631076/is-this-the-fastest-way-to-group-in-pandas