Issue 1054943: Python may contain NFC/NFKC bug per Unicode PRI #29 (original) (raw)

Created on 2004-10-26 23:58 by rick_mcgowan, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
unicode_pr29.patch	vstinner,2009-05-04 10:42

Messages (11)
msg22884 - (view)	Author: Rick McGowan (rick_mcgowan)	Date: 2004-10-26 23:58
The Unicode Technical Committee posted Public Review Issue #29, describing a bug in the documentation of NFC and NFKC in the text of UAX #15 Unicode Normalization Forms. I have examined unicodedata.c in the Python implementation (2.3.4) and it appears the implementation of normalization in Python 2.3.4 may have the bug therein described. Please see the description of the bug and the textual fix that is being made to UAX #15, at the URL: http://www.unicode.org/review/pr-29.html The bug is in the definition of rule D2, affecting the characters "blocked" during re-composition. You may contact me by e-mail, or fill out the Unicode.org error reporting form if you have any questions or concerns. Since Python uses Unicode internally, it may also be wise to have someone from the Python development community on the Unicode Consortium's notification list to receive immediate notifications of public review issues, bugs, and other announcements affecting implementation of the standard.
msg22885 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2004-10-27 18:11
Logged In: YES user_id=38388 Thanks for submitting a bug report. The problem does indeed occur in the Python normalization code: >>> unicodedata.normalize('NFC', u'\u0B47\u0300\u0B3E') u'\u0b4b\u0300' I think the following line in unicodedata.c needs to be changed: if (comb1 && comb == comb1) { /* Character is blocked. */ i1++; continue; } to if (comb && (comb1 == 0 \|	comb == comb1)) { /* Character is blocked. */ i1++; continue; } Martin, what do you think ?
msg22886 - (view)	Author: Rick McGowan (rick_mcgowan)	Date: 2004-10-27 20:11
Logged In: YES user_id=1146994 Thanks all for quick reply. My initial thoughts regarding a fix were as below. The relevant piece of code seems to be in function "nfc_nfkc()" in the file unicodedata.c > if (comb1 && comb == comb1) { > /* Character is blocked. / > i1++; > continue; > } That should possibly be changed to: > if (comb1 && (comb <= comb1)) { > / Character is blocked. */ > i1++; > continue; > } because the new spec says "either B is a starter or it has the same or higher combining class as C".
msg22887 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2005-03-15 08:59
Logged In: YES user_id=21627 Is it true that the most recent interpretation of this PR suggests that the correction should only apply to Unicode 4.1? If so, I think Python should abstain from adopting the change right now, and should defer that to the point when the Unicode 4.1 database is incorporated.
msg22888 - (view)	Author: Rick McGowan (rick_mcgowan)	Date: 2005-03-15 16:45
Logged In: YES user_id=1146994 Yes. The "current" version of UAX #15 is an annex to Unicode 4.1, which will be coming out very soon. No previous versions of Unicode have been changed. Previous versions of UAX #15 apply to previous versions of the standard. The UTC plans to issue a "corrigendum" for this problem, and the corrigendum is something that can be applied to implementations of earlier versions of Unicode. In that case, one would cite the implementation of "Unicode Version X with Corrigendum Y" as shown on the "Enumerated Versions" page of the Unicode web site. To follow corrigenda, you may want to keep tabs on the "Updates and Errata" page on the Unicode web site. This is likely to be Corrigendum #5. You could fix the bug when you update Python to Unicode 4.1, or fix it when the corrigendum comes out. Of course, I would recommend fixing bugs sooner rather than later, but your release plans may be such that one path is easier. If it's going to be a long time before you update to 4.1, you may want to fix the bug and cite the corrigendum when it comes out. If you plan to update to 4.1 soon after it comes out, perhaps fixing the bug with that update is fine.
msg22889 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2006-03-10 12:00
Logged In: YES user_id=21627 When this is fixed, the @Part3 data in the normalization tests need to be considered as well.
msg59202 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2008-01-04 01:21
Python 2.6 and probably also 2.5 contains still the line if (comb1 && comb == comb1) {...}
msg86583 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-04-26 01:05
The code is the same as described by MAL and we're now on Unicode DB 5.1.
msg87111 - (view)	Author: STINNER Victor (vstinner) *	Date: 2009-05-04 10:42
Here is a patch fixing Unicode issue "PR29", I used the testcases given in http://www.unicode.org/review/pr-29.html
msg100382 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-03-04 12:16
Commited: r78646 (trunk), r78647 (py3k), r78648 (3.1). Leave the issue open to remember me that I have to backport to 2.6 (after the 2.6.5 release).
msg101424 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-03-21 13:42
> Commited: r78646 (trunk) Backport done: r79201 (2.6).

History
Date	User	Action	Args
2022-04-11 14:56:07	admin	set	github: 41086
2010-03-21 13:42:03	vstinner	set	status: open -> closedresolution: remind -> fixedmessages: +
2010-03-06 16:09:20	loewis	set	status: pending -> open
2010-03-06 15:30:21	ezio.melotti	set	status: open -> pendingversions: + Python 2.7, Python 3.2nosy: + ezio.melottiresolution: remindstage: test needed -> resolved
2010-03-04 12:16:53	vstinner	set	messages: +
2009-05-04 10:42:34	vstinner	set	files: + unicode_pr29.patchkeywords: + patchmessages: +
2009-04-26 01:05:37	ajaksu2	set	versions: + Python 3.1, - Python 2.5nosy: + ajaksu2, vstinnermessages: + type: behaviorstage: test needed
2008-01-04 01:21:07	christian.heimes	set	nosy: + christian.heimesmessages: + versions: + Python 2.6, Python 2.5, - Python 2.3
2004-10-26 23:58:06	rick_mcgowan	create