[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

Antoine Pitrou solipsis at pitrou.net
Wed Sep 17 11:37:43 CEST 2014


Seriously, can this discussion move somewhere else? This has nothing to do on python-dev.

Thank you

Antoine.

On Wed, 17 Sep 2014 18:56:02 +1000 Steven D'Aprano <steve at pearwood.info> wrote:

On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:

> Guido's mantra is something like "Python's str doesn't contain > characters or even code points[1], it contains code units." But is that true? If it were true, I would expect to be able to make Python text strings containing code units that aren't code points, e.g. something like "\U12340000" or chr(0x12340000) should work, but neither do. As far as I can tell, there is no way to build a string containing items which aren't code points. I don't think it is useful to say that strings contain code units, more that they are made up from code units. Code units are the implementation: 16-bit code units in narrow builds, 32-bit code units in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and beyond. (I don't know of any Python implementation which uses UTF-8 internally, but if there was one, it would use 8-bit code units.) It isn't very useful to say that in Python 3.3 the string "A" contains the 8-bit code unit 0x41. That's conflating two different levels of explanation (the high-level interface and the underlying implemention) and potentially leads to user confusion like # 8-bit code units are bytes, right? assert b'\41' in "A" which is Not Even Wrong. http://rationalwiki.org/wiki/Notevenwrong I think it is correct to say that Python strings are sequences of Unicode code points U+0000 through U+10FFFF. There are no other restrictions, e.g. strings can contain surrogates, noncharacters, or nonsensical combinations of code points such as a U+0300 COMBINING GRAVE ACCENT combined with U+000A (newline).

> Implying > that dealing with characters (or the grapheme globs that occasionally > raise their ugly heads here) is an issue for higher-level facilities > than str to deal with. Agreed that Python doesn't offer a string type based on graphemes, and that such a facility belongs as a high-level library, not a built-in type. Also agreed that talking about characters is sloppy. Nevertheless, for English speakers at least, "code point = character" isn't too awful a first approximation. > The point being that > > > Basically, we are pretending that the each smuggled byte is single > > character > > is something of a misstatement (good enough for present purpose of > discussing email, but not good enough for the general case of > understanding how this is supposed to work when porting the construct > to other Python implementations), while > > > for string parsing purposes...but they don't match any of our > > parsing constants. > > is precisely Pythonically correct. You might want to add "because all > parsing constants contain only valid characters by construction." I don't understand what you are trying to say here. > > [*] I worried a lot that this was re-introducing the bytes/string > > problem from python2. > > It isn't, because the bytes/str problem was that given a str object > out of context you could not tell whether it was a binary blob or > text, and if text, you couldn't tell if it was external encoded text > or internal abstract text. > > That is not true here because the representations of characters vs. > smuggled bytes in str are disjoint sets. Nor am I sure what you are trying to say here either. > Footnotes: > [1] In Unicode terminology, a code unit is the smallest computer > object that can represent a character (this is uniquely and sanely > defined for all real Unicode transformation formats aka UTFs). A code > point is an integer 0 - (17256256-1) that can represent a character, > but many code points such as surrogates and 0xFFFF are defined to be > non-characters. Actually not quite. "Noncharacter" is concretely defined in Unicode, and there are only 66 of them, many fewer than the surrogate code points alone. Surrogates are reserved, not noncharacters. http://www.unicode.org/glossary/#surrogatecodepoint http://www.unicode.org/faq/privateuse.html#nonchar1 It is wrong to talk about "surrogate characters", but perhaps you mean to say that surrogates (by which I understand you to mean surrogate code points) are "not human-meaningful characters", which is not the same thing as a Unicode noncharacter. > Characters are those code points that may be assigned > an interpretation as a character, including undefined characters > (private space and reserved). So characters are code points which are characters, including undefined characters? :-) http://www.unicode.org/glossary/#character



More information about the Python-Dev mailing list