[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Simon Cross hodgestar+pythondev at gmail.com
Fri Apr 24 09:59:03 CEST 2009


On Wed, Apr 22, 2009 at 8:50 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:

For Python 3, one proposed solution is to provide two sets of APIs: a byte-oriented one, and a character-oriented one, where the character-oriented one would be limited to not being able to represent all data accurately. Unfortunately, for Windows, the situation would be exactly the opposite: the byte-oriented interface cannot represent all data; only the character-oriented API can. As a consequence, libraries and applications that want to support all user data in a cross-platform manner have to accept mish-mash of bytes and characters exactly in the way that caused endless troubles for Python 2.x.

Is the second part of this actually true? My understanding may be flawed, but surely all Unicode data can be converted to and from bytes using UTF-8? Obviously not all byte sequences are valid UTF-8, but this doesn't prevent one from creating an arbitrary Unicode string using "utf-8 bytes".decode("utf-8"). Given this, can't people who must have access to all files / environment data just use the bytes interface?

Disclosure: My gut reaction is that the solution described in the PEP is a hack, but I'm hardly a character encoding expert. My feeling is that the correct solution is to either standardise on the bytes interface as the lowest common denominator, or to add a Path type (and I guess an EnvironmentalData type) and use the new type to attempt to hide the differences.

Schiavo Simon



More information about the Python-Dev mailing list