[Python-Dev] Python and the Unicode Character Database (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Thu Dec 2 21:23:41 CET 2010


Then these users should speak up and indicate their need, or somebody should speak up and confirm that there are users who actually want '١٢٣٤.٥٦' to denote 1234.56. To my knowledge, there is no writing system in which '١٢٣٤.٥٦e4' means 12345600.0. I'm not sure what you're after here.

That the current float() constructor accepts tons of bogus character strings and accepts them as numbers, and that it should stop doing so.

The decision to add this support was deliberate based on the desire to support as much of the nice features of Unicode in Python as we could. At least that was what was driving me at the time.

At the time, this may have been the right thing to do. With the experience gained, we should now conclude to revert this particular aspect.

Some references you may want to read up on:

http://en.wikipedia.org/wiki/NumbersinChineseculture http://en.wikipedia.org/wiki/Vietnamesenumerals http://en.wikipedia.org/wiki/Koreannumerals http://en.wikipedia.org/wiki/Japanesenumerals

I don't question that people use non-ASCII characters to denote numbers. I claim that the specific support in Python for that has no connection to reality. I further claim that the use of non-ASCII numbers is a local convention, and that if you provide a library to parse numbers, users (of that library) will somehow have to specify which notational convention(s) is reasonable for the input they have.

Even MS Office supports them:

http://languages.siuc.edu/Chinese/LanguageSettings.html

That's printing, though, not parsing.

Notice that Python does not currently support printing numbers in other scripts - even though this may actually be more useful than parsing.

Note that the support in float() (and the other numeric constructors) to work with Unicode code points was explicitly added when Unicode support was added to Python and has been available since Python 1.6.

That doesn't necessarily make it useful. Alexander's complaint is that it makes Python unstable (i.e. changing as the UCD changes). If that were true, then all Unicode database (UCD) changes would make Python unstable.

That's indeed the case - they do (see the recent bug report on white space processing). However, any change makes Python unstable (in the sense that it can potentially break existing applications), and, in many cases, the risk of breaking something is well worth it.

In the case of number parsing, I think Python would be better if float() rejected non-ASCII strings, and any support for such parsing should be redone correctly in a different place (preferably along with printing of numbers).

Most certainly it is: the documentation is either underspecified, or deviates from the implementation (when taking the most plausible interpretation). This is the very definition of "bug". The implementation is not a bug and neither was this a bug in the 2.x series of the Python documentation.

Of course the 2.x documentation is wrong, in that it is severely underspecified, and the most straight-forward interpretation of the specific wording gives an incorrect impression of the implementation.

The Python 3.x docs apparently introduced a reference to the language spec which is clearly not capturing the wealth of possible inputs.

Right - but only because the 2.x documentation already suggested that the supported syntax matches the literal syntax - as that's the most natural thing to assume.

Regards, Martin



More information about the Python-Dev mailing list