[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Sep 17 02:21:56 CEST 2014

Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

R. David Murray writes:

Do what, exactly? As I understand you, you treat the unknown bytes as completely opaque, not representing any characters at all. Which is what I'm saying: those are not characters.

Yes. I thought you were saying that one could not treat the string with smuggled bytes as if it were a string.

Guido's mantra is something like "Python's str doesn't contain characters or even code points[1], it contains code units." Implying that dealing with characters (or the grapheme globs that occasionally raise their ugly heads here) is an issue for higher-level facilities than str to deal with.

The point being that

Basically, we are pretending that the each smuggled byte is single character

is something of a misstatement (good enough for present purpose of discussing email, but not good enough for the general case of understanding how this is supposed to work when porting the construct to other Python implementations), while

for string parsing purposes...but they don't match any of our parsing constants.

is precisely Pythonically correct. You might want to add "because all parsing constants contain only valid characters by construction."

[*] I worried a lot that this was re-introducing the bytes/string problem from python2.

It isn't, because the bytes/str problem was that given a str object out of context you could not tell whether it was a binary blob or text, and if text, you couldn't tell if it was external encoded text or internal abstract text.

That is not true here because the representations of characters vs. smuggled bytes in str are disjoint sets.

Footnotes: [1] In Unicode terminology, a code unit is the smallest computer object that can represent a character (this is uniquely and sanely defined for all real Unicode transformation formats aka UTFs). A code point is an integer 0 - (17256256-1) that can represent a character, but many code points such as surrogates and 0xFFFF are defined to be non-characters. Characters are those code points that may be assigned an interpretation as a character, including undefined characters (private space and reserved).

Previous message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Next message: [Python-Dev] Multilingual programming article on the Red Hat Developer blog
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list