bpo-36502: Correct documentation of str.isspace. by gnprice · Pull Request #15019 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation26 Commits4 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

gnprice

The documented definition was much broader than the real one:
there are tons of characters with general category "Other",
and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on
_PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py,
which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down
exactly what this definition means (what's a "bidirectional class"
of "B"?) can do so. The unicodedata module documentation is an
appropriate central place for our references to Unicode's own copious
documentation, so point there.

Also add to the isspace test a thorough check that the
implementation agrees with the intended definition.

https://bugs.python.org/issue36502

@gnprice

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The unicodedata module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there.

Also add to the isspace test a thorough check that the implementation agrees with the intended definition.

@the-knights-who-say-ni

Hello, and thanks for your contribution!

I'm a bot set up to make sure that the project can legally accept your contribution by verifying you have signed the PSF contributor agreement (CLA).

Unfortunately we couldn't find an account corresponding to your GitHub username on bugs.python.org (b.p.o) to verify you have signed the CLA (this might be simply due to a missing "GitHub Name" entry in your b.p.o account settings). This is necessary for legal reasons before we can look at your contribution. Please follow the steps outlined in the CPython devguide to rectify this issue.

You can check yourself to see if the CLA has been received.

Thanks again for your contribution, we look forward to reviewing it!

@tirkarthi

@mangrisano

@gnprice

There are several issues regarding definition of whitespace and usage. Some isspace related issues :

Ah, thanks for those links!

https://bugs.python.org/issue36502

Indeed this looks like exactly the documentation bug this patch fixes.

https://bugs.python.org/issue18236

This is a proposal to change the behavior of isspace to use the Unicode White_Space property, which didn't yet exist when Python first gained Unicode support. That sounds like very much the right thing to me.

That will represent a change in behavior; I'm not sure what that means for releasing such a change. It's also significant work (as we don't currently parse that part of the Unicode database at all), which perhaps partly explains why it hasn't been done in the 6 years since it was proposed.

For any release before the change to use White_Space gets done, I think a doc fix like this one would be an improvement. The "Other" category is very far from either the actual or desired behavior.

vstinner

@@ -617,7 +618,14 @@ def test_isspace(self):
self.checkequalnofix(True, '\u2000', 'isspace')
self.checkequalnofix(True, '\u200a', 'isspace')
self.checkequalnofix(False, '\u2014', 'isspace')
# apparently there are no non-BMP spaces chars in Unicode 6
for i in range(0x10000):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a constant declared at the top level of the module. Why not testing the full range of the UCS? range(0x110000)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a constant declared at the top level of the module.

Ah, good idea. I think in fact perhaps I'll make the range itself a constant, and then be able to write:

for codepoint in BMP_CODEPOINTS:
  char = chr(codepoint)

because that seems to get at the meaning very well.

(Or perhaps for char in bmp_chars():. OTOH I just spent a few minutes trying to see a clean way to make it for char in BMP_CHARS:, and couldn't find one.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not testing the full range of the UCS? range(0x110000)

Well, the most causal answer to "why" 🙂 is that I first added this as a test in test_unicodedata.py before seeing the existing test_isspace here -- and two of the central tests there (running through the whole database and computing overall checksums) have a loop much like this one, so I followed their lead.

Those two tests (test_method_checksum and test_function_checksum) look to be substantially unchanged since they were first added in 6a20ee7 back in 2000. They probably should be updated to go through the full range of codepoints; I think back in 2000 this was the full range of codepoints.

(Hmm... as I look closer at what test_method_checksum in particular does, it's really all about methods on str, and doesn't actually mention unicodedata. Perhaps it should move to test_unicode.py.)

The one real cost of looping through all planes (vs. just the BMP) is test runtime -- the test takes about 50ms for just the BMP (on my fast desktop) but 900ms for everything. Fast test suites are precious, so in test_isspace it seems a bit of a shame to spend almost a second looping through values where nothing at all is expected to happen. Well worth it for the loops in test_unicodedata, though, where there's plenty happening the whole way along.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have a constant declared at the top level of the module. Why not testing the full range of the UCS? range(0x110000)

Just pushed an update making both of these changes. Thanks for the review!

I was about to edit the existing two loops in test_unicodedata to run the full range too, but then realized that that requires updating the checksum. I feel like that should be a separate change and not mixed in with this one. So I'll keep that around as a commit on a branch on top of this one; I can send it right after this PR is complete, or also happy to squash it in here if you think that'd be preferable.

@gnprice

vstinner

@@ -102,6 +106,14 @@ def test_function_checksum(self):
result = h.hexdigest()
self.assertEqual(result, self.expectedchecksum)
def test_isspace_invariant(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you are testing str.isspace() rather than the unicodedata module here. I suggest to move this test to test_unicode.py instead, after test_isspace().

You wrote that the test takes 900 ms: that's slow. I suggest to decorate the test with:

    @support.requires_resource('cpu')

So it will be skipped on slow CIs but only run on fast CIs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you are testing str.isspace() rather than the unicodedata module here. I suggest to move this test to test_unicode.py instead, after test_isspace().

Sure; I've waffled on that myself. Will move it back to there.

(Note that the existing test at the top of test_unicodedata.py similarly doesn't mention the unicodedata module. Perhaps it'd be good to move it to test_unicode.py too, though I think not as part of this PR.)

You wrote that the test takes 900 ms: that's slow. I suggest to decorate the test [...]

Ah, thanks! Will do.

@@ -14,6 +14,10 @@
encoding = 'utf-8'
errors = 'surrogatepass'
def all_chars():
'''Each Unicode codepoint, as a one-character string.'''
for codepoint in range(0x110000):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, I only expected a constant, not a whole function. I'm not sure that such function is needed, since it's only called once. I suggest to remove the function, and use sys.maxunicode rather than 0x110000. So the test will be updated automatically if sys.maxunicode is increased.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made it a generator because I think that makes the abstraction a bit crisper: the test just says for char in all_chars(), and the meaning is it iterates through all possible characters. But open-coding the loop works too; I can change it.

So the test will be updated automatically if sys.maxunicode is increased.

I doubt very much that sys.maxunicode will ever be changed in the future. 😉 Expanding Unicode beyond the original 16 bits was traumatic enough -- it's still a widespread cause of bugs, decades later -- and when they did, they gave themselves quite a lot of safety margin.

If you'd find the code clearer to read that way, though, that'd certainly be a good reason.

FWIW I find the hex constant somewhat clearer. One reason is that I don't have to wonder about the fencepost -- note that the correct version would be range(sys.maxunicode + 1).

As one form of data, a quick grep finds the existing usage in the codebase leans toward the explicit hex:

$ git grep maxunicode | wc -l
31

$ git grep -ie 0x10ffff -e 0x110000 | wc -l
100

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt very much that sys.maxunicode will ever be changed in the future.

It's not just about being future-proof: a named constant also helps to better self-explain the code, it helps the readability. I always wanted to replace hardcoded constants with named constants: I just did it in PR #15067.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. In this case, as I said I find the explicit literal somewhat clearer; but I definitely think that's the important criterion here, and I don't mind changing it.

@gnprice

@gnprice

@vstinner I just pushed another update:

* 578335f44 Move back to test_unicode; open-code loop; mark as uses-CPU.

Thanks again for the review!

I wrote the explicit range(0x110000) rather than range(sys.maxunicode + 1), for the reasons mentioned above. Happy to change it if you'd prefer, though.

vstinner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM apart the hardcoded 0x110000 constant (so I don't approve the change yet, sorry!).

for ch in ['\U00010401', '\U00010427', '\U00010429', '\U0001044E',
'\U0001F40D', '\U0001F46F']:
self.assertFalse(ch.isspace(), '{!a} is not space.'.format(ch))
@support.requires_resource('cpu')
def test_isspace_invariant(self):
for codepoint in range(0x110000):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to use sys.maxunicode + 1 here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK; done!

@gnprice

@gnprice gnprice changed the titleCorrect documentation of str.isspace. bpo-36502: Correct documentation of str.isspace.

Aug 3, 2019

@gnprice

@vstinner Ping -- I believe this will be quick for you to act on, because I made the small change you requested in the last round.

I see you're back online this week, so replying just to make sure this returns to your inbox 🙂

@miss-islington

Thanks @gnprice for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

@miss-islington

I'm having trouble backporting to 3.8. Reason: 'Error 110 while writing to socket. Connection timed out.'. Please retry by removing and re-adding the needs backport to 3.8 label.

@miss-islington

Thanks @gnprice for the PR, and @vstinner for merging it 🌮🎉.. I'm working now to backport this PR to: 3.8.
🐍🍒⛏🤖

@miss-islington

Sorry @gnprice and @vstinner, I had trouble checking out the 3.8 backport branch.
Please backport using cherry_picker on command line.
cherry_picker 6bccbe7dfb998af862a183f2c36f0d4603af2c29 3.8

@vstinner

@gnprice: The bot failed to automatically backport the change to Python 3.8. Can you please try to create a PR for Python 3.8?

gnprice added a commit to gnprice/cpython that referenced this pull request

Aug 15, 2019

@gnprice

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The unicodedata module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there.

Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.

@bedevere-bot

@gnprice

Thanks @vstinner for the reviews and the merge!

Can you please try to create a PR for Python 3.8?

Done. Turned out to be an easy merge conflict in the imports block.

vstinner pushed a commit that referenced this pull request

Aug 19, 2019

@gnprice @vstinner

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The unicodedata module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there.

Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request

Aug 19, 2019

@gnprice @miss-islington

…ythonGH-15296)

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The unicodedata module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there.

Also add to the isspace() test a thorough check that the implementation agrees with the intended definition. (cherry picked from commit 8c1c426)

Co-authored-by: Greg Price gnprice@gmail.com

miss-islington added a commit that referenced this pull request

Aug 19, 2019

@miss-islington @gnprice

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The unicodedata module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there.

Also add to the isspace() test a thorough check that the implementation agrees with the intended definition. (cherry picked from commit 8c1c426)

Co-authored-by: Greg Price gnprice@gmail.com

lisroach pushed a commit to lisroach/cpython that referenced this pull request

Sep 10, 2019

@gnprice @lisroach

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The unicodedata module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there.

Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.

DinoV pushed a commit to DinoV/cpython that referenced this pull request

Jan 14, 2020

@gnprice @DinoV

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The unicodedata module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there.

Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.

websurfer5 pushed a commit to websurfer5/cpython that referenced this pull request

Jul 20, 2020

@gnprice @websurfer5

The documented definition was much broader than the real one: there are tons of characters with general category "Other", and we don't (and shouldn't) treat most of them as whitespace.

Rewrite the definition to agree with the comment on _PyUnicode_IsWhitespace, and with the logic in makeunicodedata.py, which is what generates that function and so ultimately governs.

Add suitable breadcrumbs so that a reader who wants to pin down exactly what this definition means (what's a "bidirectional class" of "B"?) can do so. The unicodedata module documentation is an appropriate central place for our references to Unicode's own copious documentation, so point there.

Also add to the isspace() test a thorough check that the implementation agrees with the intended definition.