[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces (original) (raw)

Baptiste Carvello baptiste13z at free.fr
Wed Apr 29 11:09:09 CEST 2009


Glenn Linderman a écrit :

If there is going to be a required transformation from de novo strings to funny-encoded strings, then why not make one that people can actually see and compare and decode from the displayable form, by using displayable characters instead of lone surrogates?

The problem with your "escape character" scheme is that the meaning is lost with slicing of the strings, which is a very common operation.

I though half-surrogates were illegal in well formed Unicode. I confess to being weak in this area. By "legitimate" above I meant things like half-surrogates which, like quarks, should not occur alone? "Illegal" just means violating the accepted rules. In this case, the accepted rules are those enforced by the file system (at the bytes or str API levels), and by Python (for the str manipulations). None of those rules outlaw lone surrogates. [...]

Python could as well specify that lone surrogates are illegal, as their meaning is undefined by Unicode. If this rule is respected language-wise, there is no ambiguity. It might be unrealistic on windows, though.

This rule could even be specified only for strings that represent filesystem paths. Sure, they are the same type as other strings, but the programmer usually knows if a given string is intended to be a path or not.

Baptiste



More information about the Python-Dev mailing list