Issue 17153: tarfile extract fails when Unicode in pathname (original) (raw)
Created on 2013-02-07 16:43 by vinay.sajip, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Messages (7)
Author: Vinay Sajip (vinay.sajip) *
Date: 2013-02-07 16:43
The attached file failing.tar.gz contains a path with UTF-8-encoded Unicode. This causes extractall() to fail, but only when the destination path is Unicode. That's because it leads to a implicit str->unicode conversion using ASCII.
Test script:
import shutil, tarfile, tempfile
tf = tarfile.open('failing.tar.gz', 'r:gz') workdir = tempfile.mkdtemp() try: # N.B. ensure dest path is Unicode to trigger the failure tf.extractall(unicode(workdir)) finally: shutil.rmtree(workdir)
Result:
$ python untar.py Traceback (most recent call last): File "untar.py", line 8, in tf.extractall(unicode(workdir)) File "/usr/lib/python2.7/tarfile.py", line 2046, in extractall self.extract(tarinfo, path) File "/usr/lib/python2.7/tarfile.py", line 2083, in extract self._extract_member(tarinfo, os.path.join(path, tarinfo.name)) File "/usr/lib/python2.7/posixpath.py", line 71, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 44: ordinal not in range(128)
Author: Mark Lawrence (BreamoreBoy) *
Date: 2014-06-20 23:46
@Lars can we have a response on this issue please?
Author: Lars Gustäbel (lars.gustaebel) *
Date: 2014-07-08 10:40
IIRC, tarfile under 2.7 has never been explicitly unicode-safe, support for unicode objects is heterogeneous at best. The obvious work-around is to work exclusively with str objects.
What we can't do is to decode the utf-8 pathname from the archive to a unicode object, because we have no way to detect an archive's encoding. We can either emit a warning if the user passes a unicode object to extract() or we implicitly encode the passed unicode object using TarFile.encoding, so that the os.path.join() succeeds.
Unfortunately, I am not entirely sure if there was possibly a rationale behind the current behaviour of extract(). This needs more inspection.
Author: Vadim Markovtsev (Vadim Markovtsev2)
Date: 2016-08-10 12:50
So... The bug persists in 3.5 ad 3.6. It prevents from e.g. unpacking tarballs coming from GitHub repos with Unicode file names.
Author: Vadim Markovtsev (Vadim Markovtsev2)
Date: 2016-08-10 12:54
Relevant issue in pip: https://github.com/pypa/setuptools/issues/710
Author: Vinay Sajip (vinay.sajip) *
Date: 2016-08-10 20:01
Could you point to some suitable projects from GitHub whose tarballs fail on 3.5 / 3.6? My script in the first post, with the replacing of "unicode(...)" with "str(...)" and my original failing archive, works on Python 3.5 and 3.6 on Linux. Which platform have you seen failures on?
Author: Zackery Spytz (ZackerySpytz) *
Date: 2021-05-31 21:06
Python 2.7 is no longer supported, so I think this issue should be closed.
History
Date
User
Action
Args
2022-04-11 14:57:41
admin
set
github: 61355
2021-05-31 22:27:36
vinay.sajip
set
status: open -> closed
resolution: out of date
stage: resolved
2021-05-31 21:06:36
ZackerySpytz
set
nosy: + ZackerySpytz
messages: +
2016-08-11 15:17:31
BreamoreBoy
set
nosy: - BreamoreBoy
2016-08-10 20:01:27
vinay.sajip
set
messages: +
2016-08-10 12:54:14
Vadim Markovtsev2
set
messages: +
2016-08-10 12:50:55
Vadim Markovtsev2
set
nosy: + Vadim Markovtsev2
messages: +
2014-07-08 10:40:12
lars.gustaebel
set
messages: +
2014-06-20 23:46:17
BreamoreBoy
set
nosy: + BreamoreBoy
messages: +
2013-02-08 10:19:47
hynek
set
nosy: + hynek
2013-02-07 16:45:09
vinay.sajip
set
nosy: + lars.gustaebel
2013-02-07 16:44:07
vinay.sajip
set
files: + failing.tar.gz
2013-02-07 16:43:21
vinay.sajip
create