Issue 10568: Python and the Unicode Character Database (original) (raw)

Two recently reported issues brought into light the fact that Python language definition is closely tied to character properties maintained by the Unicode Consortium. [1,2] For example, when Python switches to Unicode 6.0.0 (planned for the upcoming 3.2 release), we will gain two additional characters that Python can use in identifiers. [3]

With Python 3.1:

Traceback (most recent call last): File "", line 1, in File "", line 1 ೱ = 1 ^ SyntaxError: invalid character in identifier

but with Python 3.2a4:

1

Of course, the likelihood is low that this change will affect any user, but the change in str.isspace() reported in [1] is likely to cause some trouble:

[u'A', u'B']

[u'A\u200bB']

While we have little choice but to follow UCD in defining str.isidentifier(), I think Python can promise users more stability in what it treats as space or as a digit in its builtins. For example, I don't think that supporting

1234.56

is more important than to assure users that once their program accepted some text as a number, they can assume that the text is ASCII.

[1] http://bugs.python.org/issue10567 [2] http://bugs.python.org/issue10557 [3] http://www.unicode.org/versions/Unicode6.0.0/#Database_Changes