.iterrows takes too long and generate large memory footprint · Issue #7683 · pandas-dev/pandas (original) (raw)

When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.

The name of the function implies that it is an iterator and should not take much to run. However, in the method it uses builtin method 'zip', which can sometimes generate huge temporary list of tuples if optimisation is not done correctly.

Below is the code which can reproduce the issue on a box with 16GB memory.

s1 = range(30000000) s2 = np.random.randn(30000000) ts = pd.date_range('20140101', freq='S', periods=30000000) df = pd.DataFrame({'s1': s1, 's2': s2}, index=ts) for r in df.iterrows(): break # expected to return immediately, yet it takes more than 2 minutes and uses 4G memory