[Python-Dev] Python and the Unicode Character Database (original) (raw)

Alexander Belopolsky alexander.belopolsky at gmail.com
Thu Dec 2 17:41:11 CET 2010


On Thu, Dec 2, 2010 at 8:36 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:

On Wed, 1 Dec 2010 22:28:49 -0500 Alexander Belopolsky <alexander.belopolsky at gmail.com> wrote: ..

This matches my limited research on this topic as well.  However, I am not sure that when these codes are embedded in Arabic text, their logical order always matches their display order. That shouldn't matter, since unicode text follows logical order. The display order is up to the graphical representation library.

I am not so sure. On my Mac, U+200F (RIGHT-TO-LEFT MARK) affects 0-9 and Arabic-Indic decimals differently:

print('\u200F123') ‏123 print('\u200F\u0661\u0662\u0663') 231

I replaced Arabic-Indic decimals with 0-9 in the output to demonstrate the point. Cut-n-paste does not work well in the presence of RTL directives.

and U+202E (RIGHT-TO-LEFT OVERRIDE) reverts the display order for both:

print('\u202E123') 321 print('\u202E\u0661\u0662\u0663') 321

(again, the output display is simulated not copied.) I don't know if explicit RTL directives are ever used in Arabic texts, but it is quite possible that texts converted from older formats would use them for efficiency.

Note that my point is not to find the correct answer here, but to demonstrate that we as a group don't have the expertise to get parsing of Arabic text right. If we've got it right for Arabic, it is by chance and not by design. This still leaves us with 41 other types of digits for at least 30 different languages. Nobody will ever assume that python builtins are suitable for use with all these variants. This "feature" is only good for nefarious purposes such as hiding extra digits in innocent-looking files or smuggling binary data through naive interfaces.

PS: BTW, shouldn't int('\u0661\u0662\u06DD') be valid? or is it int('\u06DD\u0661\u0662')?



More information about the Python-Dev mailing list