[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)
Toshio Kuratomi a.badger at gmail.com
Thu Dec 4 23:51:25 CET 2008
- Previous message: [Python-Dev] Python-3.0, unicode, and os.environ
- Next message: [Python-Dev] Python-3.0, unicode, and os.environ
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Terry Reedy wrote:
Toshio Kuratomi wrote:
I opened up bug http://bugs.python.org/issue4006 a while ago and it was suggested in the report that it's not a bug but a feature and so I should come here to see about getting the feature changed :-) It does you no good and (and will irritate others) to conflate 'design decision I do not agree with' with 'mistaken documentation or implementation of a design decision'. The former is opinion, the latter is usually fact (with occasional border cases). The latter is what core developers mean by 'bug'. Noted. However, there's also a difference between "Prevents us from doing useful things" and "Allows doing a useful thing in a non-trivial manner". The latter I would call a difference in design decision and the former I would call a bug in the design.
Currently in python3 there's no way to get at environment variables that are not encoded in the system default encoding. My understanding is that this isn't a problem on Windows systems but on *nix this is a huge problem. environment variables on *nix are a sequence of non-null bytes. These bytes are almost always "characters" but they do not have to be. Further, there is nothing that requires that the characters be in the same encoding; some of the characters could be in the UTF-8 character set while others are in latin-1, shift-jis, or big-5. To me, mixing encodings within a string is at least slightly insane. If by design, maybe even a 'design bug' ;-). As an application level developer I echo your sentiment :-) I recognize, though, that *nix filesystem semantics were designed many years before unicode and the decision to treat filenames, environment variables, and so much else as bytes follows naturally from the C definition of a char. It's up to a higher level than the OS to decide how to displa6 the bytes.
[shell server and fileserver result in this insane PATH]
PATH=/bin:/usr/bin:/usr/local/bin:/mnt/\xe3\x83\x8d\xe3\x83\x83\xe3\x83\x88\xe3\x83\xaf\xe3\x83\xbc\xe3\x82\xaf/\x83v\x83\x8d\x83O\x83\x89\x83\x80
I would think life would be ultimately easier if either the file server or the shell server automatically translated file names from jis and utf8 and back, so that the PATH on the *nix shell server is entirely utf8.
This is not possible because no part of the computer knows what the encoding is. To the computer, it's just a sequence of bytes. Unlike xml or the windows filesystem (winfs? ntfs?) where the encoding is specified as part of the document/filesystem there's nothing to tell what encoding the filenames are in.
How would you ever display a mixture to users?
This is up to the application. My recomendation would be to keep the raw bytes (to access the file on the filesystem) and display the results of str(filename, errors='replace') to the user.
What if there were an ambiguous component that could be legally decoded more than one way? The ambiguity is the reason that the fileserver and shell server can't automatically translate the filename (many encodings merely use all of the 2^8 byte combinations available in a C char type. This makes the byte decodable in any one of those encodings). In the application, only using the raw bytes to access the file also prevents ambiguity because the raw bytes only references one file.
Now comes the problematic part. One of the user's on the system wants to write a python3 program that needs to determine if a needed program is in the user's PATH. He tries to code it like this::
#!/usr/bin/python3.0 import os for directory in os.environ['PATH']: programs = os.listdir(directory) That code raises a KeyError because python3 has silently discarded the PATH due to the shift-jis encoded path elements. Much more importantly, there's no way the programmer can handle the KeyError and actually get the PATH from within python. Have you tried os.system or os.popen or the subprocess module to use and get a response from a native *nix command? On Windows Sure, you can subprocess your way out of a lot of sticky situations since you're essentially delegating the task to a C routine. But there are drawbacks:
- You become dependent on an external program being available. What happens if your code is run in a chroot, for instance?
- Do we want anyone writing programs that access the environment on *NIX to have to discover this pattern themselves and implement it?
As for wrapping this up in os.*, that isn't necessary -- the python3 interpreter already knows about the byte-oriented environment; it just isn't making it available to people programming in python.
-Toshio
-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 197 bytes Desc: OpenPGP digital signature URL: <http://mail.python.org/pipermail/python-dev/attachments/20081204/c9faf0e7/attachment.pgp>
- Previous message: [Python-Dev] Python-3.0, unicode, and os.environ
- Next message: [Python-Dev] Python-3.0, unicode, and os.environ
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]