msg257355 - (view) |
Author: Patrik Dufresne (Patrik Dufresne) |
Date: 2016-01-02 19:53 |
With python 3.4, Tarfile doesn't properly support adding a files with bytes path. Only unicode is supported. It's failing with exception similar to: tar.add(os.path.join(dirpath, filename), filename) File "/usr/lib/python3.4/tarfile.py", line 1907, in add tarinfo = self.gettarinfo(name, arcname) File "/usr/lib/python3.4/tarfile.py", line 1767, in gettarinfo arcname = arcname.replace(os.sep, "/") TypeError: expected bytes, bytearray or buffer compatible object It uses os.sep, and u"/". Instead, it should use something like posixpath.py:_get_sep(path). |
|
|
msg257356 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-01-02 20:01 |
See also issue 21996. |
|
|
msg257357 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-01-02 20:03 |
Does using a surrogateescape encoded filename work? (You won't get the error you report...my question is, does that do the right thing when building the archive?) |
|
|
msg257381 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2016-01-02 22:39 |
Is the tarfile module designed to support bytes for file names in general? The documentation doesn’t seem to mention bytes anywhere relevant. This seems more like a new feature rather than a bug to me. |
|
|
msg257386 - (view) |
Author: Patrik Dufresne (Patrik Dufresne) |
Date: 2016-01-02 23:39 |
> Is the tarfile module designed to support bytes for file names in general? The documentation doesn’t seem to mention bytes anywhere relevant. This seems more like a new feature rather than a bug to me. I'm using bytes in Unix to represent a path. From `os.path` docs : The path parameters can be passed as either strings, or bytes. Applications are encouraged to represent file names as (Unicode) character strings. Unfortunately, some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files. As such, I'm expecting to use bytes to represent a path with tarfile. Also, tar file format doesn't define any specific encoding for filename. I'me xpecting to but any kind of bytes data for a given filename... since this was wokring in tarfile with py2. > Does using a surrogateescape encoded filename work? (You won't get the error you report...my question is, does that do the right thing when building the archive?) I will need to have further look into surrogateescape. I read somewhere it was an experimental feature, so I didn't try it. Thanks both for your quick feedback in this holidays. |
|
|
msg257388 - (view) |
Author: Martin Panter (martin.panter) *  |
Date: 2016-01-03 00:16 |
It looks like surrogate-escaped bytes should be supported thanks to Issue 8390, although this is not so useful if you use the “pax” format (which always uses UTF-8 internally). To generate a surrogate-escaped string, you can “decode” it with the following error handler: >>> b"non-as\xA9ii".decode("ascii", "surrogateescape") 'non-as\udca9ii' |
|
|
msg257422 - (view) |
Author: Patrik Dufresne (Patrik Dufresne) |
Date: 2016-01-03 15:33 |
It's a bit tricky, but with help of surrogateescape I get the expected result. I'm closing this bug. Thanks |
|
|