Issue 1571184: Generate numeric/space/linebreak from Unicode database. (original) (raw)

Created on 2006-10-05 07:57 by andersch, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
Unicodedata_part1.patch andersch,2006-10-05 07:57 Generate unicodedata part1
Unicodedata_part2.patch andersch,2006-10-05 08:00 Generate unicodedata part2
Unicodedata.patch andersch,2006-10-06 09:44
unicodedata-2.7.patch amaury.forgeotdarc,2009-07-01 00:03
Messages (9)
msg51199 - (view) Author: Anders Chrigström (andersch) Date: 2006-10-05 07:57
This patch changes the functions _PyUnicode_ToNumeric, _PyUnicode_IsLinebreak and _PyUnicode_IsWhitespace from having to be manually updated into being generated from data in the unicode database. It will allso read numeric values for characters whos numeric type is defined in the Unihan.txt file and not in the UnicodeData.txt file. The patch should work for both the release25-maint branch as well as the trunk. The patch is so big i had to split it into two files for sourcefore to accept it.
msg51200 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-10-05 10:45
Logged In: YES user_id=38388 Instead of attaching the patch with the generated code, could you please just attach the script that generates the files and/or any patch needed to support the new generation of the above three functions ? That makes reviewing this a lot easier. Thanks.
msg51201 - (view) Author: Anders Chrigström (andersch) Date: 2006-10-06 09:44
Logged In: YES user_id=621306 Here is a patch without the generated files.
msg84457 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-03-30 02:04
I believe this one is out of date, but without a sample test to check verifying is harder...
msg89954 - (view) Author: Vernon Cole (vernondcole) Date: 2009-06-30 22:39
Adding Python 2.6 to the list of affected versions - as that is where I found the bug reported in issue 6383 (now superseded by this one.)
msg89959 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-07-01 00:03
Here is a refreshed version of the patch, without the generated files. The patch combines several changes which are fairly independent from each other: - Using the unicode database to generate the functions adds 143 new codepoints to PyUnicode_ToNumeric, and one codepoint to PyUnicode_IsWhitespace. - In addition, PyUnicode_ToNumeric now contains code for all numerics; previously those which are also digits fell in the 'default:' case and were converted with PyUnicode_ToDigit(). This adds 468 new codepoints, but removes the need to call PyUnicode_ToDigit() - The Unihan.txt files (two files to download, 25Mb each) are now parsed, and this adds 73 more codepoints to PyUnicode_ToNumeric. (There are now 1009 entries in this function.) The 3.2.0 version of this file contains two huge numbers: 1e16 and 1e20, I had to widen the type of 'change_record.numeric_changed' from 'int' to 'double'. It is possible that these were removed from the Unicode database between versions 4.1 and 5.1. - the database has a new flag, NUMERIC_MASK, used by PyUnicode_IsNumeric. This adds ~350 lines in the arrays of numbers in unicodetype_db.h If this patch is accepted, the md5 checksum in test_unicodedata.py will need to change.
msg93597 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-10-05 12:33
Marc-Andre, could you comment on this patch? The comments above were made by inspecting the generated code, comparing with the previous version. IMO the only drawback is the increased memory usage.
msg93600 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009-10-05 12:55
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > Marc-Andre, could you comment on this patch? > The comments above were made by inspecting the generated code, comparing > with the previous version. > IMO the only drawback is the increased memory usage. I haven't tried applying the patch, but from reading it, it looks good.
msg93663 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-10-06 21:35
Patch applied with r75272. Merged to py3k, adapted and regenerated files with r75274.
History
Date User Action Args
2022-04-11 14:56:20 admin set github: 44085
2010-04-01 11:58:25 flox link issue1498930 superseder
2009-10-06 21:35:55 amaury.forgeotdarc set status: open -> closedresolution: fixedmessages: +
2009-10-05 12:55:02 lemburg set messages: +
2009-10-05 12:38:07 amaury.forgeotdarc set files: - unicodectype_ucs4-2.patch
2009-10-05 12:37:39 amaury.forgeotdarc set files: + unicodectype_ucs4-2.patch
2009-10-05 12:33:21 amaury.forgeotdarc set messages: +
2009-07-01 00:03:30 amaury.forgeotdarc set files: + unicodedata-2.7.patchnosy: + amaury.forgeotdarcmessages: +
2009-06-30 23:01:22 ezio.melotti set nosy: + ezio.melotti
2009-06-30 22:39:43 vernondcole set nosy: + vernondcolemessages: + versions: + Python 2.6, Python 3.0
2009-06-30 21:29:30 amaury.forgeotdarc link issue6383 superseder
2009-06-30 21:29:30 amaury.forgeotdarc unlink issue6383 dependencies
2009-06-30 19:11:47 loewis link issue6383 dependencies
2009-03-30 02:04:52 ajaksu2 link issue1571170 dependencies
2009-03-30 02:04:07 ajaksu2 set versions: + Python 3.1, Python 2.7nosy: + ajaksu2messages: + type: enhancementstage: test needed
2006-10-05 07:57:32 andersch create