msg55444 - (view) |
Author: (Kevin Ar18) |
Date: 2007-08-29 21:34 |
Summary: If you have a zip file that contains a file inside of it greater than 2GB, then the zipfile module is unable to read that file. Steps to Reproduce: 1. Create a zip file several GB in size with a file inside of it that is over 2GB in size. 2. Attempt to read the large file inside the zip file. Here's some sample code: import zipfile import re dataObj = zipfile.ZipFile("zip.zip","r") for i in dataObj.namelist(): if(i[-1] == "/"): print "dir" else: fileName = re.split(r".*/",i,0)[1] fileData = dataObj.read(i) Result: Python returns the following error: File "...\zipfile.py", line 491, in read bytes = self.fp.read(zinfo.compress_size) OverflowError: long it too large to convert to int Expected Result: It should copy the data into the variable fileData... I'll try to post more info in a follow-up. |
|
|
msg55461 - (view) |
Author: Gregory P. Smith (gregory.p.smith) *  |
Date: 2007-08-30 04:48 |
i'll take care of it. any more info in the interim will be appreciated. |
|
|
msg55482 - (view) |
Author: (Kevin Ar18) |
Date: 2007-08-30 14:52 |
Here's another bug report that talks about a 2GB file limit: http://bugs.python.org/issue1189216 The diff offered there does not solve the problem; actually it's possible that the diff may not have anything to do with fixing the problem (though I'm not certain), but may just be a readability change. I tried to program a solution based on other stuff I saw/read on the internet, but ran into different problems.... I took the line: bytes = self.fp.read(zinfo.compress_size) and made it read a little bit at a time and add the result to bytes as it went along. This was really slow (as it had to add the result to the larger and larger bytes string each time); I tried with a list, but I couldn't find how to join the list back together into a string when done (similar to the javascript join() method). However, even with the list method, I ran into an odd "memory error," as it looped through into the higher numbers, that I have no idea why it was happening, so I gave up at that point. Also, I have no idea if this one line in the zipfile module is the only problem or if there are others that will pop up once you get that part fixed. |
|
|
msg55485 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2007-08-30 15:33 |
I now see the problem. What you want to do cannot possibly work. You are trying to create a string object that is larger than 2GB; this is not possible on a 32-bit system (which I assume you are using). No matter how you modify the read() function, it would always return a string that is so large it cannot fit into the address space. This will be fixed in Python 2.6, which has a separate .open method, allowing to read the individual files in the zipfile as streams. |
|
|
msg55486 - (view) |
Author: (Kevin Ar18) |
Date: 2007-08-30 15:43 |
Just some thoughts.... In posting about this problem elsewhere, it has been argued that you shouldn't be copying that much stuff into memory anyways (though there are possible cases for a need for that). However, the question is what should the zipfile module do. At the very least it should account for this 2GB limitation and say it can't do it. However, how it should interact with the programmer is another question. In one of the replies, I am told that strings have a 2GB limitation, which means the zipfile module can't be used in it's current form, even if fixed. Does this mean that the zipfile module needs to add some additional methods for incrementally getting data and writing data? Or does it mean that programmers should be the ones to program an incremental system when they need it... Or? |
|
|
msg55487 - (view) |
Author: (Kevin Ar18) |
Date: 2007-08-30 15:45 |
So, just add an error to the module (so it won't crash)? BTW, is Python 2.6 ready for use? I could use that feature now. :) |
|
|
msg55488 - (view) |
Author: (Kevin Ar18) |
Date: 2007-08-30 15:46 |
Maybe a message that says that strings on 32-bit CPUs cannot handle more than 2GB of data; use the stream instead? |
|
|
msg60248 - (view) |
Author: Gregory P. Smith (gregory.p.smith) *  |
Date: 2008-01-19 23:39 |
The issue here was that reading more data than will fit into an in memory string fails. While the zipfile module could detect this in some cases, it is not really worth such a runtime check. This is just a fact of python and of sane programming, if you're reading data from a file like object you should never use unbounded reads without having checked your input for sanity first. |
|
|