[Python-Dev] Python and the Unicode Character Database (original) (raw)
Alexander Belopolsky alexander.belopolsky at gmail.com
Thu Dec 2 19:14:29 CET 2010
- Previous message: [Python-Dev] Python and the Unicode Character Database
- Next message: [Python-Dev] Python and the Unicode Character Database
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Dec 2, 2010 at 11:56 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
Le jeudi 02 décembre 2010 à 11:41 -0500, Alexander Belopolsky a écrit :
Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right. I don't understand why you think Arabic or Hebrew text is any different from Western text. Surely right-to-left isn't more conceptually complicated than left-to-right, is it?
No, but a mix of LTR and RTL is certainly more difficult that either of the two. I invite you to digest Unicode Standard Annex #9 before we continue this discussion.
See <http://unicode.org/reports/tr9/>.
The fact that mixed rtl + ltr can render bizarrely or is awkward to cut and paste is quite off-topic for our discussion.
No, it is not. One of the invented use cases in this thread was naive users' desire to enter numbers using their preferred local decimals. Same users may want to be able to cut and paste their decimals as well. More importantly, however, legacy formats may not have support for mixed-direction text and may require that "John is 41" be stored as "41 si nhoJ" and Unicode converter would turn it into "[RTL]John is 14" that will still display as "41 si nhoJ", but int(s[-2:]) will return 14, not 41.
If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages. So why do you trust the Unicode standard on other things and not on this one?
What other things? As far as I understand the only str method that was designed to comply with Unicode recomendations was str.isidentifier(). And we have some really bizarre results:
'\u2164'.isidentifier() True '\u2164'.isalpha() False
and can you describe the difference between str.isdigit() and str.isdecimal()? According to the reference manual,
""" str.isdecimal() Return true if all characters in the string are decimal characters and there is at least one character, false otherwise. Decimal characters include digit characters, and all characters that that can be used to form decimal-radix numbers, e.g. U+0660, ARABIC-INDIC DIGIT ZERO.
str.isdigit() Return true if all characters in the string are digits and there is at least one character, false otherwise. """ http://docs.python.org/dev/library/stdtypes.html#str.isdecimal
Since U+0660 is mentioned in the first definition and not in the second, I may conclude that it is not a digit, but
'\u0660'.isdigit() True
If you know the correct answer, please contribute it here: <http://bugs.python.org/issue10587>.
- Previous message: [Python-Dev] Python and the Unicode Character Database
- Next message: [Python-Dev] Python and the Unicode Character Database
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]