[Python-Dev] Python and the Unicode Character Database (original) (raw)
Alexander Belopolsky alexander.belopolsky at gmail.com
Fri Dec 3 06:10:29 CET 2010
- Previous message: [Python-Dev] Python and the Unicode Character Database
- Next message: [Python-Dev] Python and the Unicode Character Database
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Dec 2, 2010 at 4:57 PM, Mark Dickinson <dickinsm at gmail.com> wrote: ..
(the decimal spec requires that non-European digits be accepted).
Mark,
I think requires is too strong of a word to describe what the spec says. The decimal module documentation refers to two authorities:
- IBM’s General Decimal Arithmetic Specification
- IEEE standard 854-1987
The IEEE standards predates Unicode and unsurprisingly does not have anything related to the issue. the IBM's spec says the following in the Conversions section:
""" It is recommended that implementations also provide additional number formatting routines (including some which are locale-dependent), and if available should accept non-European decimal digits in strings. """ http://speleotrove.com/decimal/daconvs.html
This cannot possibly be interpreted as normative text. The emphasis is clearly on "formatting routines" with "non-European decimal digits" added as an afterthought. This recommendation can reasonably be interpreted as a requirement that conversion routines should accept what formatting routines can produce. In Python there are no formatting routines to produce non-European numerals, so there is no requirement to accept them in conversions.
I don't think decimal module should support non-European decimal digits. The only place where it can make some sense is in int() because here we have a fighting chance of producing a reasonable definition. The motivating use case is conversion of numerical data extracted from text using simple '\d+' regex matches.
Here is how I would do it:
String x of non-European decimal digits is only accepted in int(x), but not by int(x, 0) or int(x, 10).
If x contains one or more non-European digits, then
(a) all digits must be from the same block:
def basepoint(c): return ord(c) - unicodedata.digit(c) all(basepoint(c) == basepoint(x[0]) for c in x) -> True
(b) and '+' or '-' sign is not alowed.
A character c is a digit if it matches '\d' regex. I think this means unicodedata.category(c) -> 'Nd'.
Condition 2(b) is important because there is no clear way to define what is acceptable as '+' or '-' using Unicode character properties and not all number systems even support local form of negation. (It is also YAGNI.)
- Previous message: [Python-Dev] Python and the Unicode Character Database
- Next message: [Python-Dev] Python and the Unicode Character Database
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]