.iterrows takes too long and generate large memory footprint · Issue #7683 · pandas-dev/pandas (original) (raw)

When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.

The name of the function implies that it is an iterator and should not take much to run. ~~However~~, ~~in the method it uses builtin method 'zip'~~, ~~which can sometimes generate huge temporary list of tuples if optimisation is not done correctly~~.

Below is the code which can reproduce the issue on a box with 16GB memory.

s1 = range(30000000) s2 = np.random.randn(30000000) ts = pd.date_range('20140101', freq='S', periods=30000000) df = pd.DataFrame({'s1': s1, 's2': s2}, index=ts) for r in df.iterrows(): break # expected to return immediately, yet it takes more than 2 minutes and uses 4G memory