msg106139 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-05-20 12:09 |
The file system is hardcoded to UTF-8 on Mac OS X, whereas the locale encoding... depends on the locale. See issue #4388 for the details. I think that we should use the locale encoding to encode and decode command line arguments. We have to create a new encoding variable used for the command line arguments: * Py_CommandLineEncoding * sys.getcmdlineencoding() * (no sys.setcmdlineencoding() please!) * ... This encoding only should be used on POSIX: Windows native type is unicode (wchar_t*). It should be used to decode sys.argv and to encode child processes arguments (subprocess, os.exec*(), etc.)). On Linux, it should change anything because the file system encoding is the locale encoding. Said differently, Python3 does already use the locale encoding for the command arguments on Linux. If you pass a filename on the command line and then open it: the filename is decoded with the locale encoding, and then encoded with the file system encoding. I fear that it will fail if both encodings are differents... |
|
|
msg106150 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-05-20 13:02 |
Fix the title: sys.argv is already decoded using the locale encoding on Unix, the problem is that it uses a (possibly) different encoding to encode command line arguments: file system encoding. |
|
|
msg106171 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-05-20 17:23 |
> I think that we should use the locale encoding to encode and decode command line arguments. I disagree. IIUC, this is only about OSX. Now, we shouldn't take any action until either some OSX expert explains us how command line arguments are being passed on OSX, or we find some Apple documentation that can be taken as a specification. I think the C locale is very poorly supported on OSX, and we shouldn't really use it for anything. What may be useful is the terminal encoding (which may be different both from UTF-8 and the locale encoding), however, it's not possible to find out what the terminal encoding is. In addition, programs may be started "directly" (i.e. not from the terminal), in which case the terminal encoding would be irrelevant. For file name arguments at least, it's very clear that the command line arguments also use the file system encoding. |
|
|
msg106543 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-05-26 17:01 |
@loewis: You restored the original (wrong) title "Use locale encoding to decode sys.argv, not the file system encoding", instead of the new (good) title "Use locale encoding to encode command line arguments (subprocess, os.exec*(), etc.)". Is it wanted or not? |
|
|
msg108151 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-06-18 23:42 |
Attached patch is a draft adding a new encoding: command line encoding. It is used to encode (subprocess) and decode (python) the command line arguments. It adds sys.getcmdlineencoding(). |
|
|
msg108153 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-06-18 23:54 |
I'm still -1, failing to see the problem that is solved. |
|
|
msg108154 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2010-06-18 23:55 |
> I'm still -1, failing to see the problem that is solved. I know (and I agree), but I don't want to loose the patch :-) |
|
|
msg111432 - (view) |
Author: Ronald Oussoren (ronaldoussoren) *  |
Date: 2010-07-24 09:14 |
This issue only seems to be relevant for OSX, and then only for OSX releases before 10.5, because in that release Apple made sure that the LANG variable and simular LC_* ones specify a UTF-8 encoding and we're back at the common case where the filesystem encoding matches the locale encoding. A system where the filesystem encoding doesn't match the locale encoding is hard to get right. While it would be possible to add sys.cmdlineencoding that doesn't actually solve the semantic problem because external tools might not cooperate. That is, most system tools seem to work with bytes internally and do not treat arguments as text encoded in the locale encoding that should be re-encoded in the filesystem encoding before passing them to the C APIs. That is, when calling "ls somefile" the "ls" command will pass the bytes in argv[1] to the POSIX routines for getting file information without trying to reencode. In short, having a filesystem encoding that is different from the command-line only works when all system tools cooperate and are unicode aware. To be honest, I'd say the behavior of OSX 10.4 is a bug and we might add a workaround on that platform that uses CFStringGetSystemEncoding() to fetch the actual system encoding when LANG=C. (And I'm -1 on adding the patch) See also: |
|
|
msg111456 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-07-24 11:26 |
It seems that everybody now agrees to close this issue as "won't fix". |
|
|