[Python-Dev] Python-3.0, unicode, and os.environ (original) (raw)
Steve Holden steve at holdenweb.com
Thu Dec 11 13:13:49 CET 2008
- Previous message: [Python-Dev] Python-3.0, unicode, and os.environ
- Next message: [Python-Dev] Python-3.0, unicode, and os.environ
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ulrich Eckhardt wrote:
On Wednesday 10 December 2008, Adam Olsen wrote:
On Wed, Dec 10, 2008 at 3:39 AM, Ulrich Eckhardt
<eckhardt at satorlaser.com> wrote: On Tuesday 09 December 2008, Adam Olsen wrote: The only thing separating this from a bikeshed discussion is that a bikeshed has many equally good solutions, while we have no good solutions. Instead we're trying to find the least-bad one. The unicode/bytes separation is pretty close to that. Adding a warning gets even closer. Adding magic makes it worse. Well, I see two cases: 1. Converting from an uncertain representation to a known one. 2. Converting from a known representation to a known one. Not quite: 1. Using a garbage file name locally (within a single process, not talking to any libs) 2. Using a unicode filename everywhere (libs, saved to config files, displayed to the user, etc.) I think there is some misunderstanding. I was referring to conversions and whether it is good to perform them implicitly. For that, I saw the above two cases. On linux the bytes/unicode separation is perfect for this. You decide which approach you're using and use it consistently. If you mess up (mixing bytes and unicode) you'll consistently get an error. We currently don't follow this model on windows, so a garbage file name gets passed around as if it was unicode, but fails when passed to a lib, saved to a config file, is displayed to a user, etc. I'm not sure I agree with this. Facts I know are: 1. On POSIX systems, there is no reliable encoding for filenames while the system APIs use char/byte strings. 2. On MS Windows, the encoding for filenames is Unicode/UTF-16. Returning Unicode strings from readdir() is wrong because it can't handle the case 1 above. Returning byte strings is wrong because it can't handle case 2 above because it gives you useless roundtrips from UTF-16 to either UTF-8 or, worst case, to the locale-dependent MBCS. Returning something different depending on the system us also broken because that would make Python code that uses this function and assumes a certain type unportable. Note that this doesn't get much better if you provide a separate readdirb() API or one that simply returns a byte string or Unicode string depending on its argument. It just shifts the brokenness from readdir() to the code that uses it, unless this code makes a distinction between the target systems. Since way too many programmers are not aware of the problem, they will not handle these systems differently, so code will become non-portable. What I'd just like some feedback on is the approach to return a distinct type (neither a byte string nor a Unicode string) from readdir(). In order to use this, a programmer will have to convert it explicitly, otherwise e.g. printing it will just produce <envstring at 0x01234567>. This will immediately bump each programmer with their heads on the issue of unknown encodings and they will have to make the application-specific choice whether an approximation of the filename, an exception or ignoring the file is the right choice. Also, it presents the options for doing this conversion in a single class, which I personally find much better than providing overloads for hundreds of functions. Sorry for ranting, but I'm a bit confused and desperate, because either I'm unable to explain what I mean or I'm really not understanding something that everybody else here seems to agree upon. I just know that using a distinct path type has helped me in C++ in the past, and I don't see why it shouldn't in Python. Seems to me this just threatens to add to the confusion.
If you know what your filesystem produces, you can take the appropriate action to convert it into a type that makes sense to the user. If you don't, then at least if you have the string in its bytes form you can re-present it to the filesystem to manipulate the file. What are we supposed to do with the "special type"?
regards Steve
Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
- Previous message: [Python-Dev] Python-3.0, unicode, and os.environ
- Next message: [Python-Dev] Python-3.0, unicode, and os.environ
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]