It would be helpful to have a tarfile iterator that does not cache every archive member encountered. This makes it nearly impossible to iterate over an archive with millions of files.
I assume you're using Python 2.x. because tarfile's memory footprint was significantly reduced in Python 3.0, see the patch in and r62337. This patch was not backported to the 2.x branch back then. As the 2.x branch has been closed for new features, this is not going to happen in the future.
Yes, I'm on 2.6. I checked the Python 3.x tarfile just for this one line in TarFile.next(): self.members.append(tarinfo) to conclude it would have the same problem. Reducing 2.5gb memory usage as measured in my particular case by 60%, still leaves 1.5gb ram burned which is too much on a 32-bit 2gb ram machine. My solution was to comment out that line which worked perfectly for my case but may not be the solution for the module.
There is no trivial or backwards-compatible solution to this problem. The way it is now, there is no alternative to storing all TarInfo objects: there is no central table of contents in an archive we could use, so we must create our own. In other words, tarfile does not "burn" memory without a reason. The problem you encounter is somehow a corner case, fortunately with a simple workaround: for tarinfo in tar: ... tar.members = [] There are two things that I will clearly refuse to do. One thing is to add yet another option to the TarFile class to switch off caching as this would make many TarFile methods dysfunctional without the user knowing why. The other thing is to add an extra non-caching Iterator class. Sorry, that I have nothing more to offer. Maybe, someone else comes up with a brilliant idea.