bpo-29456: bugs in unicodedata.normalize: u1176, u11a7 and u11c3 by Pusnow · Pull Request #1958 · python/cpython (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation19 Commits6 Checks0 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

Pusnow

@corona10

@Pusnow
I am not a committer of this library.
But here is a one thing I want to review.
Can you add test codes about your changing?
You can add your test cases in here.

Thank you.

@Pusnow

Okay, I added some tests for the issue.

mdickinson

int LIndex, VIndex;
LIndex = code - LBase;
VIndex = PyUnicode_READ(kind, data, i+1) - VBase;
code = SBase + (LIndex*VCount+VIndex)*TCount;
i+=2;
if (i < len &&
TBase <= PyUnicode_READ(kind, data, i) &&
PyUnicode_READ(kind, data, i) <= (TBase+TCount)) {
TBase < PyUnicode_READ(kind, data, i) &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this should be < rather than <=?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
That code determines PyUnicode_READ(kind, data, i) is a trailing(final) consonant while TBase(0x11A7) is the last Vowel in Hangul (Hangul Jamo).
So < is correct rather than <=.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! And after checking (which I should have done before leaving my comment), I see that this agrees with section 3.12 of (version 10 of ) the standard.

Still, Python eyes are rather used to seeing half-open ranges, so anything other than lower <= value < high looks surprising. Is it worth adding a comment explaining what's going on?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll add some comments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just added some comments. Is it enough?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Yes, that's helpful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Pusnow

I think it can be merged. Is there anything I need to do?

@Pusnow

@Pusnow

@corona10

@Pusnow

Done, thank you for response.

zhangyangyu

@vstinner

@miss-islington

Thanks @Pusnow for the PR, and @zhangyangyu for merging it 🌮🎉.. I'm working now to backport this PR to: 2.7, 3.6, 3.7.
🐍🍒⛏🤖

@bedevere-bot

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request

Jun 15, 2018

@Pusnow @miss-islington

…ythonGH-1958)

Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon pusnow@me.com

@bedevere-bot

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request

Jun 15, 2018

@Pusnow @miss-islington

…ythonGH-1958)

Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon pusnow@me.com

@miss-islington

Sorry, @Pusnow and @zhangyangyu, I could not cleanly backport this to 2.7 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker d134809cd3764c6a634eab7bb8995e3e2eff14d5 2.7

miss-islington added a commit that referenced this pull request

Jun 15, 2018

@miss-islington @Pusnow

…H-1958)

Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon pusnow@me.com

zhangyangyu pushed a commit to zhangyangyu/cpython that referenced this pull request

Jun 15, 2018

@Pusnow @zhangyangyu

…u11c3 (pythonGH-1958)

Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).. (cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon pusnow@me.com

@bedevere-bot

miss-islington added a commit that referenced this pull request

Jun 15, 2018

@miss-islington @Pusnow

…H-1958)

Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]). (cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon pusnow@me.com

zhangyangyu added a commit that referenced this pull request

Jun 15, 2018

@zhangyangyu @Pusnow

…H-1958) (GH-7704)

Hangul composition check boundaries are wrong for the second character ([0x1161, 0x1176) instead of [0x1161, 0x1176]) and third character ((0x11A7, 0x11C3) instead of [0x11A7, 0x11C3]).. (cherry picked from commit d134809)

Co-authored-by: Wonsup Yoon pusnow@me.com