Issue 36100: Document the differences between str.isdigit, isdecimal and isnumeric (original) (raw)

Created on 2019-02-24 09:02 by StyXman, last changed 2022-04-11 14:59 by admin.

Messages (12)
msg336451 - (view) Author: Marcos Dione (StyXman) * Date: 2019-02-24 09:02
Following https://blog.lerner.co.il/pythons-str-isdigit-vs-str-isnumeric/, we have this: Python 3.8.0a1+ (heads/master:001fee14e0, Feb 20 2019, 08:28:02) [GCC 8.2.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> '一二三四五'.isnumeric() True >>> int('一二三四五') Traceback (most recent call last): File "", line 1, in ValueError: invalid literal for int() with base 10: '一二三四五' >>> float('一二三四五') Traceback (most recent call last): File "", line 1, in ValueError: could not convert string to float: '一二三四五' I think Reuven is right, these should be accepted as input. I just wonder if we should do the same for f.i. roman numerics...
msg336453 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-02-24 10:05
I think that analysis is wrong. The Wikipedia page describes the meaning of the Unicode Decimal/Digit/Numeric properties: https://en.wikipedia.org/wiki/Unicode_character_property#Numeric_values_and_types and the characters you show aren't appropriate for converting to ints: py> for c in '一二三四五': ... print(unicodedata.name(c)) ... CJK UNIFIED IDEOGRAPH-4E00 CJK UNIFIED IDEOGRAPH-4E8C CJK UNIFIED IDEOGRAPH-4E09 CJK UNIFIED IDEOGRAPH-56DB CJK UNIFIED IDEOGRAPH-4E94 The first one, for example, is translated as "one; a, an; alone"; it is better read as the *word* one rather than the numeral 1. (Disclaimer: I am not a Chinese speaker and I welcome correction from an expert.) Likewise U+4E8C, translated as "two; twice". The blog post is factually wrong when it claims: "str.isdigit only returns True for what I said before, strings containing solely the digits 0-9." py> s = "\N{BENGALI DIGIT ONE}\N{BENGALI DIGIT TWO}" py> s.isdigit() True py> int(s) 12 So I think that there's nothing to do here (unless it is perhaps to add a FAQ about it, or improve the docs).
msg336454 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-02-24 10:14
[Steven posted his answer while I was composing mine; posting mine anyway ...] I don't think this would make sense. There are lots of characters that can't be interpreted as a decimal digit but for which `isnumeric` nevertheless gives True. >>> s = "㉓⅗⒘Ⅻ" >>> for c in s: print(unicodedata.name(c)) ... CIRCLED NUMBER TWENTY THREE VULGAR FRACTION THREE FIFTHS NUMBER SEVENTEEN FULL STOP ROMAN NUMERAL TWELVE >>> s.isnumeric() True What value would you expect `int(s)` to have in this situation? Note that `int` and `float` already accept non-ASCII digits: >>> s = "١٢٣٤٥٦٧٨٩" >>> int(s) 123456789 >>> float(s) 123456789.0
msg336455 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-02-24 10:24
> What value would you expect `int(s)` to have in this situation? Actually, I guess that question was too easy. The value for `int(s)` should *obviously* be 23 * 1000 + (3/5) * 100 + 17 * 10 + 12 = 23242. I should have used ⅐ instead of ⅗. Anyway, agreed with Steven that no change should be made here.
msg336456 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-02-24 10:31
Not a unicode expert but searching along the lines there was a note added on that int() is supported for characters of 'Nd' category. So to check if a string can be converted to integer with help of int() I should be using str.isdecimal() instead of str.isnumeric() ? https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex > The numeric literals accepted include the digits 0 to 9 or any Unicode equivalent (code points with the Nd property). See http://www.unicode.org/Public/10.0.0/ucd/extracted/DerivedNumericType.txt for a complete list of code points with the Nd property. >>> [unicodedata.category(c) for c in '一二三四五'] ['Lo', 'Lo', 'Lo', 'Lo', 'Lo'] >>> [unicodedata.category(c) for c in '\N{BENGALI DIGIT ONE}\N{BENGALI DIGIT TWO}'] ['Nd', 'Nd']
msg336459 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-02-24 10:57
> So to check if a string can be converted to integer with help of int() I should be using str.isdecimal() instead of str.isnumeric() ? Yes, I think that's correct. The characters matched by `str.isdecimal` are a subset of those matched by `str.isdigit`, which in turn are a subset of those matched by `str.isnumeric`. `int` and `float` required general category Nd, which corresponds to `str.isdigit`.
msg336460 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2019-02-24 10:58
> which corresponds to `str.isdigit`. Gah! That should have said: > which corresponds to `str.isdecimal`. Sorry.
msg336461 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-02-24 11:07
> `int` and `float` required general category Nd, which corresponds to `str.isdigit`. Sorry, did you mean str.isdecimal? since there could be a subset where isdigit is True and isdecimal returns False. >>> '\u00B2'.isdecimal() False >>> '\u00B2'.isdigit() True >>> import unicodedata >>> unicodedata.category('\u00B2') 'No' >>> int('\u00B2') Traceback (most recent call last): File "", line 1, in ValueError: invalid literal for int() with base 10: '²' Is this worth an FAQ or an addition to the existing note on int that specifies characters should belong to 'Nd' category to add a note that str.isdecimal should return True
msg336462 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-02-24 11:13
On Sun, Feb 24, 2019 at 11:07:41AM +0000, Karthikeyan Singaravelan wrote: > Is this worth an FAQ or an addition to the existing note on int that > specifies characters should belong to 'Nd' category to add a note that > str.isdecimal should return True Yes, I think that there should be a FAQ about the differences between isdigit, isdecimal and isnumeric, pointing to the relevant Unicode documentation. I would also like to see a briefer note added to each of the string methods docstrings as well.
msg336464 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2019-02-24 11:44
Agreed, though str.isnumeric behavior might seem to be correct in terms of user who knows unicode internals the naming makes it easy to be used for a general user on trying to determine if the string can be used for int() without knowing unicode internals. I am not sure how this can be explained in simpler terms but it would be good if clarified in the docs to avoid confusion. There seems to be have been thread [0] in the past about multiple ways to check for a unicode literal to be number causing confusion. It adds more confusion on Python 2 where strings are not unicode by default. $ python2.7 Python 2.7.14 (default, Mar 12 2018, 13:54:56) [GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> '\u00B2'.isdigit() False >>> u'\u00B2'.isdigit() True [0] https://mail.python.org/pipermail/python-list/2012-May/624340.html
msg336466 - (view) Author: Marcos Dione (StyXman) * Date: 2019-02-24 12:39
Thanks for all the examples, I'm convinced.
msg336467 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2019-02-24 13:32
I'm re-opening the ticket with a change of subject, because I think this should be treated as a documentation enhancement: - improve the docstrings for str.isdigit, isnumeric and isdecimal to make it clear what each does (e.g. what counts as a digit); - similarly improve the documentation for int and float? although the existing comment may be sufficient https://docs.python.org/3/library/stdtypes.html#numeric-types-int-float-complex - add a FAQ summarizing the situation. I don't think we need to worry about backporting the docs to Python 2, but if others disagree, I won't object.
History
Date User Action Args
2022-04-11 14:59:11 admin set github: 80281
2019-02-24 13:32:58 steven.daprano set status: closed -> openresolution: not a bug -> assignee: docs@pythonstage: resolved -> title: int() and float() should accept any isnumeric() digit -> Document the differences between str.isdigit, isdecimal and isnumericnosy: + docs@pythonversions: - Python 2.7, Python 3.4, Python 3.5, Python 3.6, Python 3.7, Python 3.8messages: + components: + Documentation, - Library (Lib)type: behavior -> enhancement
2019-02-24 12:39:23 StyXman set status: open -> closedversions: + Python 3.4, Python 3.5, Python 3.6messages: + resolution: not a bugstage: resolved
2019-02-24 11:44:36 xtreak set messages: + versions: - Python 3.4, Python 3.5, Python 3.6
2019-02-24 11:13:40 steven.daprano set messages: +
2019-02-24 11:07:41 xtreak set messages: +
2019-02-24 10:58:35 mark.dickinson set messages: +
2019-02-24 10:57:40 mark.dickinson set messages: +
2019-02-24 10:31:50 xtreak set nosy: + xtreakmessages: +
2019-02-24 10:24:17 mark.dickinson set messages: +
2019-02-24 10:14:50 mark.dickinson set nosy: + mark.dickinsonmessages: +
2019-02-24 10:05:25 steven.daprano set nosy: + steven.dapranomessages: +
2019-02-24 09:02:22 StyXman create