msg106276 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-05-22 01:22 |
mbcs encoding replace non encodable characters (loose information) and doesn't support surrogateescape error handler. It ignores the error handler argument: see #850997, and tarfile now uses surrogateescape error handler by default (#8390). This encoding is just horrible for unicode support :-) Since Windows native API use unicode character (UTF-16), I think that it would be better to use utf-8 for the default encoding on Windows. utf-8 is able to encode and decode the full Unicode charset and supports all error handlers (especially surrogateescape). Attached patch sets the default encoding to utf-8 on Windows, and removes the test ENCODING is None because sys.getfilesystemencoding() cannot be None anymore (in 3.2 only, it's a recent change: #8610). |
|
|
msg106758 - (view) |
Author: Lars Gustäbel (lars.gustaebel) *  |
Date: 2010-05-30 12:31 |
My expertise on Windows is rather limited, but as far as I understand the issue, I consider this a reasonable idea. I think it is impossible to find a perfect default encoding, and utf-8 seems to be the best bet with regard to portability. IIRC most of the archivers on the Windows machines I have access to use latin-1, but I don't think that latin-1 is a suitable default value. I don't know much about Windows internals and have no idea what mbcs really is, but it is actually not available on other platforms. |
|
|
msg107435 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-06-09 22:57 |
I created a TAR archive with the 7-zip archiver of file with diacritics in their name (eg. "é" and "à"). Then I opened the archive with WinRAR: the file names were not displayed correctly :-/ 7-zip encodes "à" (U+00e0) as 0x85 (1 byte), and "é" (U+00e9) as 0x82 (1 byte). I don't know this encoding. |
|
|
msg107438 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-06-09 23:09 |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > I created a TAR archive with the 7-zip archiver of file with diacritics in their name (eg. "é" and "à"). Then I opened the archive with WinRAR: the file names were not displayed correctly :-/ > > 7-zip encodes "à" (U+00e0) as 0x85 (1 byte), and "é" (U+00e9) as 0x82 (1 byte). I don't know this encoding. That's an old DOS code paged used in Europe: CP850 http://en.wikipedia.org/wiki/Code_page_850 |
|
|
msg107440 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-06-09 23:21 |
Marc-Andre Lemburg wrote: > > Marc-Andre Lemburg <mal@egenix.com> added the comment: > > STINNER Victor wrote: >> >> STINNER Victor <victor.stinner@haypocalc.com> added the comment: >> >> I created a TAR archive with the 7-zip archiver of file with diacritics in their name (eg. "é" and "à"). Then I opened the archive with WinRAR: the file names were not displayed correctly :-/ >> >> 7-zip encodes "à" (U+00e0) as 0x85 (1 byte), and "é" (U+00e9) as 0x82 (1 byte). I don't know this encoding. > > That's an old DOS code paged used in Europe: CP850 > > http://en.wikipedia.org/wiki/Code_page_850 Looks like the cmd.exe on WinXP still uses it. At least on my German WinXP it does for Python 2.3 and older. Starting with Python 2.4, the behavior changed to use CP1252 instead: D:\Python26>python Python 2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)] on wi 32 Type "help", "copyright", "credits" or "license" for more information. >>> u'àé' u'\xe0\xe9' D:\Python25>python Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> u'áé' u'\xe1\xe9' D:\Python24>python Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> u'àé' u'\xe0\xe9' D:\Python23>python Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> u'àé' u'\x85\x82' >>> |
|
|
msg107455 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-06-10 11:47 |
I created a tarball (.tar.gz) on Windows with Python 3.1 (which uses "mbcs" encoding). With locale.getpreferredencoding() == 'cp1252', "é" (U+00e9) is encoded 0xe9 (1 byte) and "à" (U+00e0) as 0xe0 (1 byte). WinRAR displays correctly the file names, but 7-zip displays the wrong glyphs. So WinRAR expects CP1252 whereas 7-zip expects CP850. I also tested an archive encoded with UTF-8: WinRAR and 7-zip display the wrong glyph, they decode utf-8 with CP1252 / CP850 :-/ If an archive will be used on UNIX, I think that the archive should use UTF-8 (on Windows and UNIX). But if the archive is read on Windows with WinRAR or 7-zip, the archive should use a codepage. Since mbcs looks to be the least worst choice, it may be used but with "replace" error handler (because it doesn't support "surrogateescape" error handler). -- About the code pages: - chcp command displays "Active code page: 850" - python -c "import locale; print(locale.getpreferredencoding())" displays "cp1252" - python -c "import sys; print(sys.stdout.encoding)" displays "cp850" Python calls GetConsoleOutputCP() to get stdout/stderr encoding (code page), whereas locale.getpreferredencoding() (_locale.getdefaultencoding()) calls GetACP(). |
|
|
msg107466 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-06-10 17:18 |
My tests with 7-zip and WinRAR conviced me that it's not a good idea to use utf-8 *by default* on Windows. But since mbcs doesn't support surrogateescape error handler, we should restore the previous behaviour just for this encoding. tarfile_mbcs_errors.patch creates a function choose_errors() which determine the best error handler depending on the encoding and the mode (read or write): - "strict" to write with mbcs - "replace" to read with mbcs - "surrogateescape" otherwise Please, review my changes on the documentation :-) On Windows, patched tarfile works exactly as Python 3.1. |
|
|
msg107467 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-06-10 17:27 |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > My tests with 7-zip and WinRAR conviced me that it's not a good idea to use utf-8 *by default* on Windows. But since mbcs doesn't support surrogateescape error handler, we should restore the previous behaviour just for this encoding. > > tarfile_mbcs_errors.patch creates a function choose_errors() which determine the best error handler depending on the encoding and the mode (read or write): > - "strict" to write with mbcs > - "replace" to read with mbcs > - "surrogateescape" otherwise I think you should implement this in a more general way: have the class test whether the codec supports "surrogateescape" and then use it. Otherwise fall back to "strict" for writing and "replace" for reading. |
|
|
msg107468 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-06-10 18:40 |
>> 7-zip encodes "à" (U+00e0) as 0x85 (1 byte), and "é" (U+00e9) as 0x82 (1 byte). I don't know this encoding. > > That's an old DOS code paged used in Europe: CP850 There is a good chance that they use it because it is the OEM code page on the system. In any case, I think that both cp850 and cp1252 are inherently incorrect for tarfiles (despite these tools using them). tar is a POSIX thing, and these encodings have nothing to do with POSIX. So using UTF-8 is a reasonable choice, IMO. The other reasonable choice would be ASCII. |
|
|
msg107469 - (view) |
Author: Lars Gustäbel (lars.gustaebel) *  |
Date: 2010-06-10 18:51 |
Maybe I'm going out on a limb here, but I think we should again consider what tarfile users on Windows(!) actually use it for under which circumstances. The following list is probably not exhaustive, but IMHO covers 90%: 1. Download tar archives from a webpage (when no zip is supplied) for viewing or extracting. 2. Create backups for personal use. 3. Create source archives from a project for unix users who hate zipfiles. I am convinced that the tarfile module is not very popular on Windows, because of a simple reason: tar archives are not. Windows users will always prefer zip archives and hence the zipfile module, because it's something they're familiar with. The point I am trying to make is, that, first, we should not choose a default encoding based on what works best with WinRAR, 7-zip and such, because they all act very differently which makes it impossible. Second, we must not overemphasize the encoding issue to a point where portability is in danger. This means that in almost all real-life cases there are no encoding issues. In my whole tarfile maintaining career I cannot remember a single incident of a tar archive that I got from an external source that contained special characters. The only tar archives that contain special characters in my experience are backups. But: these backups are created and later restored on one and the same system. Again, no encoding issues. Long story short, I still vote for utf-8, because it enables Windows users to create backups without losing special characters, and it's ASCII-"compatible" and should be able to read 99% of the files that you get from the internet. |
|
|
msg107488 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-06-10 21:14 |
> 2. Create backups for personal use. What? Really? I'm sure that all Windows users will use ZIP or maybe RAR, but never the geek choice. > 1. Download tar archives from a webpage (when no zip is supplied) for viewing or extracting. Tarballs come from UNIX/BSD world which use UTF-8 by default since some years ago. > 3. Create source archives from a project for unix users who hate zipfiles. In this case, UTF-8 is also better. -- Did I mentionned that 7-zip is only able to create TAR archive? I mean uncompressed archive. Who will use that? (not me ;-)) WinRAR is unable to create tarballs, even (uncompressed) .tar archive. -- If the maintainer of the tarfile module agrees that UTF-8 is the best choice, I will commit my initial patch. I would prefer to commit tarfile_windows_utf8.patch because it changes 4 lines, whereas tarfile_mbcs_errors.patch changes... much more code :-) tarfile_windows_utf8.patch is not complete: the documentation should also be updated: .. data:: ENCODING The default character encoding i.e. the value from either :func:`sys.getfilesystemencoding` or :func:`sys.getdefaultencoding`. => .. data:: ENCODING The default character encoding: ``'utf-8'`` on Windows, :func:`sys.getfilesystemencoding` otherwise. |
|
|
msg107491 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-06-10 21:20 |
Updated version of the utf-8 patch: - Use also UTF-8 for Windows CE - Update the documentation - Prepare the NEWS entry |
|
|
msg107492 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-06-10 21:24 |
FWIW, I agree with Lars: the main use of tar files under Windows is when they come from other systems. Windows users almost never generate tar files by themselves; they will generate zip, rar or 7z files instead. |
|
|
msg107609 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-06-11 23:49 |
Ok. I commited the patch to set the default encoding to utf-8 on Windows: r81925. |
|
|