[Python-Dev] Python and the Unicode Character Database (original) (raw)

M.-A. Lemburg mal at egenix.com
Fri Dec 3 11:15:51 CET 2010


Alexander Belopolsky wrote:

On Thu, Dec 2, 2010 at 5:58 PM, M.-A. Lemburg <mal at egenix.com> wrote: ..

I will change my mind on this issue when you present a machine-readable file with Arabic-Indic numerals and a program capable of reading it and show that this program uses the same number parsing algorithm as Python's int() or float().

Have you had a look at the examples I posted ? They include texts and tables with numbers written using east asian arabic numerals. Yes, but this was all about output. I am pretty sure TeX was able to typeset Qur'an in all its glory long before Unicode was invented. Yet, in machine readable form it would be something like {\quran 1} (invented directive). I have asked for a file that is intended for machine processing, not for human enjoyment in print or on a display. I claim that if such file exists, the program that reads it does not use the same rules as Python and converting non-ascii digits would be a tiny portion of what that program does.

Well, programs that take input from the keyboards I posted in this thread will have to deal with the digits. Since Python's input() accepts keyboard input, you have your use case :-)

Seriously, I find the distinction between input and output forms of numerals somewhat misguided. Any output can also serve as input. For books and other printed material, images, etc. you have scanners and OCR. For screen output you have screen readers. For spreadsheets and data, you have CSV, TSV, XML, etc. etc. etc.

Just for the fun of it, I created a CSV file with Thai and Dzongkha numerals (in addition to Arabic ones) using OpenOffice. Here's the cut and paste version:

""" Numbers in various scripts

Arabic Thai Dzongkha 1 ๑ ༡ 2 ๒ ༢ 3 ๓ ༣ 4 ๔ ༤ 5 ๕ ༥ 6 ๖ ༦ 7 ๗ ༧ 8 ๘ ༨ 9 ๙ ༩ 10 ๑๐ ༡༠ 11 ๑๑ ༡༡ 12 ๑๒ ༡༢ 13 ๑๓ ༡༣ 14 ๑๔ ༡༤ 15 ๑๕ ༡༥ 16 ๑๖ ༡༦ 17 ๑๗ ༡༧ 18 ๑๘ ༡༨ 19 ๑๙ ༡༩ 20 ๒๐ ༢༠ """

And here's the script that goes with it:

import csv c = csv.reader(open('Numbers-in-various-scripts.csv')) headers = [c.next() for i in range(3)] while c: print [int(unicode(x, 'utf-8')) for x in c.next()]

and the output using Python 2.7:

[1, 1, 1] [2, 2, 2] [3, 3, 3] [4, 4, 4] [5, 5, 5] [6, 6, 6] [7, 7, 7] [8, 8, 8] [9, 9, 9] [10, 10, 10] [11, 11, 11] [12, 12, 12] [13, 13, 13] [14, 14, 14] [15, 15, 15] [16, 16, 16] [17, 17, 17] [18, 18, 18] [19, 19, 19] [20, 20, 20]

If you need more such files, I can generate as many as you like ;-) I can send the OOo file as well, if you like to play around with it.

I'd say: case closed :-)

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Dec 03 2010)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Numbers-in-various-scripts.csv URL: <http://mail.python.org/pipermail/python-dev/attachments/20101203/0f4a8bee/attachment.ksh>



More information about the Python-Dev mailing list