(original) (raw)

The whole purpose of PEP 383 is to send the exact same bytes that were
read from the OS back to the OS => violating (2) (for whatever the
apparent system file-encoding is, not limited to UTF-8),

It's fine to read a file name from a file system and write the same file back as the same raw byte sequence.� That I don't have a problem with; it's not quite right, but it's harmless.

The problem with this PEP is that the malformed unicode it produces can end up in so many other places: as file names on another file system, in string processing libraries, in text files, in databases, in user interfaces, etc.�� Some of those destinations will use the utf-8b decoder, so they will get byte sequences that never could occur before and that are illegal under unicode.��

Nobody knows what will happen.� And, yes, Martin is proposing that this is the default behavior.

There are several other issues that are unresolved: utf-8b makes some current practices illegal; for example, it might break CESU-8 encodings.� Also, what are Jython and IronPython supposed to do on UNIX?� Can they implement these semantics at all?

�

and that has overwhelmingly popular support.

I think people don't fully understand the tradeoffs.� I certainly don't.� Although there is a slight benefit, there are unknown and potentially large costs. We'd be changing Python's entire unicode string behavior for the sake of one use cases.� Since our uses of Python actually involve a lot of unicode, I am wary of having malformed unicode crop up legally in Python code.

And that's why I think this proposal should be shelved for a while until people have had more time to try to understand the issues and also come up with alternative proposals.� Once this is adopted and implemented in C-Python, Python is stuck with it forever.

Tom