Issue 36502: str.isspace() for U+00A0 and U+202F differs from document (original) (raw)

Created on 2019-04-02 06:36 by Jun, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 15019 merged Greg Price,2019-08-03 08:10
PR 15296 merged Greg Price,2019-08-15 03:17
PR 15301 merged Greg Price,2019-08-15 04:48
PR 15332 merged miss-islington,2019-08-19 09:53
PR 15806 merged miss-islington,2019-09-09 16:37
PR 15807 merged miss-islington,2019-09-09 16:37
PR 15808 merged benjamin.peterson,2019-09-09 16:51
Messages (14)
msg339317 - (view) Author: Jun (Jun) * Date: 2019-04-02 06:36
I was looking for a list of Unicode codepoints that str.isspace() returns true. According to https://docs.python.org/3/library/stdtypes.html#str.isspace, it's "Whitespace characters are those characters defined in the Unicode character database as “Other” or “Separator” and those with bidirectional property being one of “WS”, “B”, or “S”." However, for U+202F(https://www.fileformat.info/info/unicode/char/202f/index.htm) which is a "Separator" and its bidirectional property is "CS", str.isspace() returns True while it shouldn't if we follow the definition above. >>> "\u202f".isspace() True I'm not sure either the documentation should be updated or behavior should be updated, but at least those should be consistent.
msg339318 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-04-02 06:59
I think you have to read that "and" as "or". It's sufficient that '\u202f' is a separator for it to be considered a whitespace character.
msg339336 - (view) Author: Jun_ (Jun_) Date: 2019-04-02 14:32
Do you mean read the statement as follows? Whitespace characters are characters that satisfy either one of: 1. Character type is "Other" 2. Character type is "Separator" 3. Characters with "WS", "B", or "S" bidirectional property If that's the case, this is also not reflect the behavior as most of characters in "Other" are not whitespace characters and in fact str.isspace() returns False for those characters.
msg339339 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2019-04-02 14:56
According to comment for _PyUnicode_IsWhitespace it's supposed to include Zs category, plus documented BIDI properties. So, I'm not sure where "Other" came from.
msg348947 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-03 08:18
The actual behavior turns out to match that comment. See attached PR, which adds a test confirming that and also corrects the documentation. (A related issue is #18236 -- we should probably adjust the definition to match the one Unicode now provides. But meanwhile we'll want to correct the docs.)
msg349678 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-14 11:05
New changeset 6bccbe7dfb998af862a183f2c36f0d4603af2c29 by Victor Stinner (Greg Price) in branch 'master': bpo-36502: Correct documentation of str.isspace() (GH-15019) https://github.com/python/cpython/commit/6bccbe7dfb998af862a183f2c36f0d4603af2c29
msg349947 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-19 09:53
New changeset 8c1c426a631ba02357112657193f82c58d3e08b4 by Victor Stinner (Greg Price) in branch '3.8': bpo-36502: Correct documentation of str.isspace() (GH-15019) (GH-15296) https://github.com/python/cpython/commit/8c1c426a631ba02357112657193f82c58d3e08b4
msg349948 - (view) Author: miss-islington (miss-islington) Date: 2019-08-19 10:10
New changeset 0fcdd8d6d67f57733203fc79e6a07a89b924a390 by Miss Islington (bot) in branch '3.7': bpo-36502: Correct documentation of str.isspace() (GH-15019) (GH-15296) https://github.com/python/cpython/commit/0fcdd8d6d67f57733203fc79e6a07a89b924a390
msg349950 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-08-19 10:14
str.isspace() documentation has been fixed, thanks Greg Price for the fix! I close the issue.
msg349983 - (view) Author: Greg Price (Greg Price) * Date: 2019-08-20 01:33
Thanks Victor for the reviews and merges! (Unmarking 2.7, because https://docs.python.org/2/library/stdtypes.html seems to not have this issue.)
msg351526 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-09 16:37
New changeset 64c6ac74e254d31f93fcc74bf02b3daa7d3e3f25 by Benjamin Peterson (Greg Price) in branch 'master': bpo-36502: Update link to UAX #44, the Unicode doc on the UCD. (GH-15301) https://github.com/python/cpython/commit/64c6ac74e254d31f93fcc74bf02b3daa7d3e3f25
msg351536 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2019-09-09 17:10
New changeset 58d61efd4cdece3b026868a66d829001198d29b1 by Benjamin Peterson in branch '2.7': [2.7] bpo-36502: Update link to UAX GH-44, the Unicode doc on the UCD. (GH-15808) https://github.com/python/cpython/commit/58d61efd4cdece3b026868a66d829001198d29b1
msg351545 - (view) Author: miss-islington (miss-islington) Date: 2019-09-09 18:40
New changeset 0a86da87da82c4a28d7ec91eb54c0b9ca40bbea7 by Miss Islington (bot) in branch '3.7': bpo-36502: Update link to UAX GH-44, the Unicode doc on the UCD. (GH-15301) https://github.com/python/cpython/commit/0a86da87da82c4a28d7ec91eb54c0b9ca40bbea7
msg351546 - (view) Author: miss-islington (miss-islington) Date: 2019-09-09 18:41
New changeset c1c04cbc24c11cd7a47579af3faffee05a16acd7 by Miss Islington (bot) in branch '3.8': bpo-36502: Update link to UAX GH-44, the Unicode doc on the UCD. (GH-15301) https://github.com/python/cpython/commit/c1c04cbc24c11cd7a47579af3faffee05a16acd7
History
Date User Action Args
2022-04-11 14:59:13 admin set github: 80683
2019-09-09 18:41:16 miss-islington set messages: +
2019-09-09 18:40:08 miss-islington set messages: +
2019-09-09 17:10:10 benjamin.peterson set messages: +
2019-09-09 16:51:56 benjamin.peterson set pull_requests: + <pull%5Frequest15459>
2019-09-09 16:37:32 miss-islington set pull_requests: + <pull%5Frequest15458>
2019-09-09 16:37:26 miss-islington set pull_requests: + <pull%5Frequest15457>
2019-09-09 16:37:16 benjamin.peterson set nosy: + benjamin.petersonmessages: +
2019-08-20 01:33:56 Greg Price set messages: + versions: - Python 2.7
2019-08-19 10:14:33 vstinner set status: open -> closedresolution: fixedmessages: + stage: patch review -> resolved
2019-08-19 10:10:23 miss-islington set nosy: + miss-islingtonmessages: +
2019-08-19 09:53:53 miss-islington set pull_requests: + <pull%5Frequest15050>
2019-08-19 09:53:40 vstinner set messages: +
2019-08-15 04:48:01 Greg Price set pull_requests: + <pull%5Frequest15026>
2019-08-15 03:17:09 Greg Price set pull_requests: + <pull%5Frequest15019>
2019-08-14 11:05:23 vstinner set messages: +
2019-08-03 08🔞56 Greg Price set nosy: + Greg Pricemessages: +
2019-08-03 08:10:30 Greg Price set keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest14836>
2019-04-05 18:38:44 terry.reedy set title: The behavior of str.isspace() for U+00A0 and U+202F is different from what is documented -> str.isspace() for U+00A0 and U+202F differs from documentversions: - Python 3.5, Python 3.6
2019-04-02 14:56:28 SilentGhost set messages: + versions: + Python 3.7, Python 3.8
2019-04-02 14:32:57 Jun_ set nosy: + Jun_messages: +
2019-04-02 06:59:53 SilentGhost set nosy: + SilentGhostmessages: +
2019-04-02 06:45:21 xtreak set nosy: + lemburg
2019-04-02 06:36:07 Jun create