Issue 10261: tarfile iterator without members caching (original) (raw)

Created on 2010-10-31 11:19 by karstenw, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (5)
msg120041 - (view)	Author: Karsten Wolf (karstenw)	Date: 2010-10-31 11:19
It would be helpful to have a tarfile iterator that does not cache every archive member encountered. This makes it nearly impossible to iterate over an archive with millions of files.
msg120042 - (view)	Author: Lars Gustäbel (lars.gustaebel) *	Date: 2010-10-31 11:34
I assume you're using Python 2.x. because tarfile's memory footprint was significantly reduced in Python 3.0, see the patch in and r62337. This patch was not backported to the 2.x branch back then. As the 2.x branch has been closed for new features, this is not going to happen in the future.
msg120043 - (view)	Author: Karsten Wolf (karstenw)	Date: 2010-10-31 11:58
Yes, I'm on 2.6. I checked the Python 3.x tarfile just for this one line in TarFile.next(): self.members.append(tarinfo) to conclude it would have the same problem. Reducing 2.5gb memory usage as measured in my particular case by 60%, still leaves 1.5gb ram burned which is too much on a 32-bit 2gb ram machine. My solution was to comment out that line which worked perfectly for my case but may not be the solution for the module.
msg123835 - (view)	Author: Lars Gustäbel (lars.gustaebel) *	Date: 2010-12-12 12:17
There is no trivial or backwards-compatible solution to this problem. The way it is now, there is no alternative to storing all TarInfo objects: there is no central table of contents in an archive we could use, so we must create our own. In other words, tarfile does not "burn" memory without a reason. The problem you encounter is somehow a corner case, fortunately with a simple workaround: for tarinfo in tar: ... tar.members = [] There are two things that I will clearly refuse to do. One thing is to add yet another option to the TarFile class to switch off caching as this would make many TarFile methods dysfunctional without the user knowing why. The other thing is to add an extra non-caching Iterator class. Sorry, that I have nothing more to offer. Maybe, someone else comes up with a brilliant idea.
msg263714 - (view)	Author: Lars Gustäbel (lars.gustaebel) *	Date: 2016-04-19 07:18
Closing after six years of inactivity.

History
Date	User	Action	Args
2022-04-11 14:57:08	admin	set	github: 54470
2016-04-19 07🔞38	lars.gustaebel	set	status: open -> closedresolution: wont fixmessages: + stage: resolved
2010-12-12 12:17:27	lars.gustaebel	set	messages: +
2010-10-31 12:05:49	pitrou	set	type: enhancement -> resource usageversions: + Python 3.1, Python 2.7, Python 3.2
2010-10-31 11:58:44	karstenw	set	messages: +
2010-10-31 11:34:39	lars.gustaebel	set	assignee: lars.gustaebelmessages: + nosy: + lars.gustaebel
2010-10-31 11:19:56	karstenw	create