Issue 1527974: tarfile chokes on ipython archive on Windows (original) (raw)

Created on 2006-07-24 21:00 by arve_knudsen, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (10)
msg29260 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-24 21:00
I'm trying to extract files from the latest ipython tar archive, available from http://ipython.scipy.org/dist/ipython-0.7.2.tar.gz, using tarfile. This is on Windows XP, using Python 2.4.3. There is only a problem if I open the archive in stream mode (the "mode" argument to tarfile.open is "r|gz"), in which case tarfile raises StreamError. I'd be happy if this error could be sorted out. The following script should trigger the error: import tarfile f = file(r"ipython-0.7.2.tar.gz", "rb") tar = tarfile.open(fileobj=f, mode="r gz") try: for m in tar: tar.extract(m) finally: tar.close() f.close( The resulting exception: Traceback (most recent call last): File "tst.py", line 7, in ? tar.extract(m) File "C:\Program Files\Python24\lib\tarfile.py", line 1335, in extract self._extract_member(tarinfo, os.path.join(path, tarinfo.name)) File "C:\Program Files\Python24\lib\tarfile.py", line 1431, in _extract_member self.makelink(tarinfo, targetpath) File "C:\Program Files\Python24\lib\tarfile.py", line 1515, in makelink self._extract_member(self.getmember(linkpath), targetpath) File "C:\Program Files\Python24\lib\tarfile.py", line 1423, in _extract_member self.makefile(tarinfo, targetpath) File "C:\Program Files\Python24\lib\tarfile.py", line 1461, in makefile copyfileobj(source, target) File "C:\Program Files\Python24\lib\tarfile.py", line 158, in copyfileobj shutil.copyfileobj(src, dst) File "C:\Program Files\Python24\lib\shutil.py", line 22, in copyfileobj buf = fsrc.read(length) File "C:\Program Files\Python24\lib\tarfile.py", line 551, in _readnormal self.fileobj.seek(self.offset + self.pos) File "C:\Program Files\Python24\lib\tarfile.py", line 420, in seek raise StreamError, "seeking backwards is not allowed" tarfile.StreamError: seeking backwards is not allowed
msg29261 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2006-07-25 03:35
Logged In: YES user_id=33168 I tested this on Linux with both 2.5 and 2.4.3+ without problems. I believe there were some fixes in this area. Could you try testing with the 2.4.3+ current which will become 2.4.4 (or 2.5b2)? If this is still a problem, it looks like it may be Windows specific.
msg29262 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-25 07:29
Logged In: YES user_id=1522083 Well yeah, it appears to be Windows specific. I just tested on Linux (Ubuntu), also with Python 2.4.3. I'll try 2.4.3+ on Windows to see if it makes any difference. Come to think of it I think I experienced this problem in that past on Linux, but then I solved it by repacking ipython. Also, if I pack it myself on Windows using bsdtar it works fine.
msg29263 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-25 08:04
Logged In: YES user_id=1522083 Ok, I've verified now that the problem persists with Python 2.4.4 (from the 2.4 branch in svn). The exact same thing happens.
msg29264 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2006-07-25 08:42
Logged In: YES user_id=642936 The traceback tells me that there is a hard link inside the archive which means that a file in the archive is referenced to twice. This hard link can be extracted only on platforms that have an os.link() function. On Win32 they're not supported by the file system, but tarfile works around this by extracting the referenced file twice. In order to extract the file the second time it is necessary that tarfile seeks back in the input file to access the file's data again. But "seeking backwards is not allowed" when a file is opened in streaming mode ;-) If you do not necessarily need streaming mode for your application, better use "r:gz" or "r" and the problem will be gone.
msg29265 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-25 08:59
Logged In: YES user_id=1522083 Thanks for the clarification, Lars. I'd prefer to continue with my current approach however, since it allows me to report progress as the tarfile is unpacked/decompressed. Also, I don't think it would be satisfactory at all if tarfile would just die with a mysterious error in such cases. In order to resolve this, why must tarfile extract the file again, can't it copy the already extracted file?
msg29266 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2006-07-25 09:31
Logged In: YES user_id=642936 Copying the previously extracted file is no option. When the archive is extracted inside a loop, you never know what happens between two extract() calls. The original file could have been renamed, changed or removed. Suppose you want to extract just those members which are hard links: for tarinfo in tar: if tarinfo.islnk(): tar.extract(tarinfo) I agree with you that the error message is bad because it does not give the slightest idea of what's going wrong. I'll see what I can do about that. To work around your particular problem, my idea is to subclass the TarFile class and replace the makelink() method with one that simply copies the file as you proposed.
msg29267 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-25 09:58
Logged In: YES user_id=1522083 Yes I admit that is a weakness to my proposed approach. Perhaps it would be a better idea to extract hardlinked files to a temporary location and copy those files when needed, as a cache? The only problem that I can think of with this approach is the overhead, but perhaps this could be configurable through a keyword if you think it would pose a significant problem (i.e. keeping extra copies of potentially huge files)? The temporary cache would be private to tarfile, so there should be no need to worry about modifications to the contained files.
msg29268 - (view) Author: Arve Knudsen (arve_knudsen) Date: 2006-07-26 22:20
Logged In: YES user_id=1522083 Regarding my last comment, sorry about the noise. After giving it some more thought I realized it was not very realistic implementation wise, seeing as you can't know whether a file is being linked to when you encounter it in the stream (right?). So I followed your suggestion instead and handled the links on the client level. What I think I'd like to see in TarFile though is an 'extractall' method with the ability to report progress to an optional callback, since I'm only opening in stream mode as a hack to implement this myself (by monitoring file position). From browsing tarfile's source it seems it might require some effort though (with e.g. BZ2File you can't know the amount of data without decompressing everything?).
msg59477 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2008-01-07 18:55
I close this issue because it is out of date. The new TarFile.extractall() method in Python 2.5 provides a way to solve the original problem IMO.
History
Date User Action Args
2022-04-11 14:56:19 admin set github: 43713
2008-01-07 18:55:05 lars.gustaebel set status: open -> closedresolution: out of datemessages: +
2006-07-24 21:00:58 arve_knudsen create