msg86400 - (view) |
Author: Jarek Sobieszek (jarek) |
Date: 2009-04-24 10:39 |
u'\u1d79'.lower() returns u'\x00' I think it should return u'\u1d79', at least according to my understanding of UnicodeData.txt (the lowercase field is empty). |
|
|
msg86401 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2009-04-24 10:49 |
It *does* return u'\u1d79' for me on Python 2.5.2: >>> u'\u1d79'.lower() u'\u1d79' >>> import sys >>> sys.version '2.5.2 (r252:60911, Apr 8 2008, 18:54:00) \n[GCC 3.3.5 (Debian 1:3.3.5-13)]' However on 2.6.2 it's broken: >>> u'\u1d79'.lower() u'\x00' >>> import sys >>> sys.version '2.6.2 (r262:71600, Apr 19 2009, 18:38:49) \n[GCC 4.0.1 (Apple Inc. build 5490)]' |
|
|
msg86405 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2009-04-24 12:57 |
The following patch fixes the problem for me, however it breaks the test suite. The change seems to have been introduced in r66362. Assigning to Martin. |
|
|
msg86406 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2009-04-24 13:05 |
The same change should be applied to _PyUnicode_ToTitlecase as well. |
|
|
msg86411 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2009-04-24 14:15 |
Updated the patch (diff2.txt) as requested by Amaury. |
|
|
msg86425 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2009-04-24 18:51 |
Py3.0.1 >>> '\u1d79'.lower() '\x00' I am guessing that this bug is in 2.7 and 3.1 as well. |
|
|
msg86447 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2009-04-25 09:16 |
Here is a third version of the patch. AFAICT the logic of the unicode database is as follows: * If the NODELTA_MASK is not set, delta is an offset. * If NODELTA_MASK is set and delta is != 0, delta is the upper/lower/title case character. * If NODELTA_MASK is set and delta is == 0, there is no upper/lower/title case variant (i.e. the method returns the original character. Is this the correct interpretation? I've also updated the testsuite (changed the checksum and added a new test). (BTW, the patch is against the py3k branch). |
|
|
msg86476 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2009-04-25 11:38 |
I think the patch is incorrect; the bug is already in makeunicodedata.py. For U+1d79, it should set the lowercase letter to U+1d79. If you look at makeunicodedata.py, you see that the entire logic is bogus: when the column is absent, it should default it to the character itself (except for titlecase, where it should default it to uppercase). Then, if it finds that one of the characters can't be delta-encoded, it should go back to changing the previous mappings as well. I'm attaching an untested patch that should do that. Also see , which is related. |
|
|
msg86506 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2009-04-25 13:37 |
I've merged your version of the patch with my changes to the test suite and regenerated the Unicode database. Attached is the resulting patch (diff4.txt) |
|
|
msg86507 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2009-04-25 13:47 |
Feel free to check it into trunk, and merge into the other three branches from there. If you don't want to do that, assign it back to me. |
|
|
msg86511 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2009-04-25 14:10 |
Checked in: r71894 (trunk) r71895 (release26-maint) |
|
|
msg86512 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2009-04-25 14:17 |
Checked in: r71896 (py3k) r71897 (release30-maint) |
|
|
msg86513 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2009-04-25 14:20 |
BTW, are the steps to regenerate the Unicode database documented somewhere? What I did was: cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/UnicodeData.txt . cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/CompositionExclusions.txt . cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/EastAsianWidth.txt . cp /Volumes/ftp.unicode.org/Public/5.1.0/ucd/DerivedCoreProperties.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/ucd/UnicodeData-3.2.0.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/CompositionExclusions-3.2.0.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/EastAsianWidth-3.2.0.txt . cp /Volumes/ftp.unicode.org/Public/3.2-Update/DerivedCoreProperties-3.2.0.txt . ./python.exe Tools/unicode/makeunicodedata.py |
|
|
msg86514 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2009-04-25 14:46 |
> BTW, are the steps to regenerate the Unicode database documented > somewhere? I don't think so - your procedure looks right, though. Regenerating the database is often more difficult, though, in particular when we upgrade to a new version. Often, the new version will add new complications which have to be dealt with, so a deep understanding of makeunicodata.py is often needed to be able to use it. Welcome to the club :-) |
|
|