[Python-3000] Are strs sequences of characters or disguised byte strings? (original) (raw)
Guido van Rossum guido at python.org
Wed Oct 3 05:28:56 CEST 2007
- Previous message: [Python-3000] Are strs sequences of characters or disguised byte strings?
- Next message: [Python-3000] Simplifying pickle for Py3k
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
String objects are arrays of code units. They can represent normalized and unnormalized Unicode text just as easily, and even invalid data, like half a surrogate and other illegal code units. It is up to the application (or perhaps at some point the library) to implement various checks and normalizations. AFAIK this is the same stance that Java and C# take -- the String types there don't concern themselves with the higher levels of Unicode standard compliance. (Though those languages probably have more library support than Python does -- perhaps someone can contribute something, like wrappers for ICU?)
However, for identifiers occurring in source code, we do normalize before comparing them. PEP 3131 should explain this.
--Guido
On 10/2/07, Mark Summerfield <mark at qtrac.eu> wrote:
In Python 3.0a1, exec() appears to normalize strings, but in other cases they don't appear to be normalized, and this leads to results that appear to be counter-intuitive in some cases, at least to me.
>>> c1 = "\u00C7" >>> c2 = "C\u0327" >>> c3 = "\u0043\u0327" >>> c1, c2, c3 ('\xc7', 'C\u0327', 'C\u0327') >>> print(c1, c2) Ç Ç Clearly c1 and c2 are different at the byte level. But if we use them to create variables using exec(), Python appears to normalize them: >>> dir() ['builtins', 'doc', 'name', 'c1', 'c2', 'c3'] >>> exec("C\u0327 = 5") >>> dir() ['builtins', 'doc', 'name', 'c1', 'c2', 'c3', '\xc7'] >>> Ç 5 >>> exec("\u00C7 = -7") >>> dir() ['builtins', 'doc', 'name', 'c1', 'c2', 'c3', '\xc7'] >>> Ç -7 This seems to be the right behaviour to me, since from the point of view of a programmer, Ç is the name of the variable, no matter what the underlying byte encoding used to represent the variable's name. >>> print(c1, c2) Ç Ç >>> c1.encode("utf8") == c2.encode("utf8") False This is what I'd expect, since here I'm comparing the actual bytes. But when I compare them as strings I really expect them to be compared as sequences of characters (in a human sense), so this: >>> c1 == c2 False seems counter-intuitive to me. It is easy to fix: >>> from unicodedata import normalize >>> normalize("NFKD", c1) == normalize("NFKD", c2) True but isn't it asking a lot of Python users to use normalize() whenever they want to perform such a basic operation as string comparison? Another issue that arises is that you can end up with duplicate dictionary keys and set elements. (The duplication is in human terms, in byte terms the keys/set elements differ of course): >>> d = {c1: 1, c2: 2} >>> d {'C\u0327': 2, '\xc7': 1} >>> for k, v in d.items(): ... print(k, v) ... Ç 2 Ç 1 I think this is surprising. >>> s = {c1, c2} >>> s {'C\u0327', '\xc7'} >>> for x in s: ... print(x) ... Ç Ç And the same result applies to sets of course. I don't know what the performance costs would be for always normalizing strings, but it seems to me that if strings are not normalized, then they are really being treated as byte strings thinly disguised as strings rather than as true sequences of characters whose byte representation is a detail that programmers can ignore (unless they choose to explicitly decode). -- Mark Summerfield, Qtrac Ltd., www.qtrac.eu
Python-3000 mailing list Python-3000 at python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
- Previous message: [Python-3000] Are strs sequences of characters or disguised byte strings?
- Next message: [Python-3000] Simplifying pickle for Py3k
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]