linebreak from Unicode database. (original) (raw)

Created on 2006-10-05 07:57 by andersch, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
Unicodedata_part1.patch	andersch,2006-10-05 07:57	Generate unicodedata part1
Unicodedata_part2.patch	andersch,2006-10-05 08:00	Generate unicodedata part2
Unicodedata.patch	andersch,2006-10-06 09:44
unicodedata-2.7.patch	amaury.forgeotdarc,2009-07-01 00:03

Messages (9)
msg51199 - (view)	Author: Anders Chrigström (andersch)	Date: 2006-10-05 07:57
This patch changes the functions _PyUnicode_ToNumeric, _PyUnicode_IsLinebreak and _PyUnicode_IsWhitespace from having to be manually updated into being generated from data in the unicode database. It will allso read numeric values for characters whos numeric type is defined in the Unihan.txt file and not in the UnicodeData.txt file. The patch should work for both the release25-maint branch as well as the trunk. The patch is so big i had to split it into two files for sourcefore to accept it.
msg51200 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-10-05 10:45
Logged In: YES user_id=38388 Instead of attaching the patch with the generated code, could you please just attach the script that generates the files and/or any patch needed to support the new generation of the above three functions ? That makes reviewing this a lot easier. Thanks.
msg51201 - (view)	Author: Anders Chrigström (andersch)	Date: 2006-10-06 09:44
Logged In: YES user_id=621306 Here is a patch without the generated files.
msg84457 - (view)	Author: Daniel Diniz (ajaksu2) *	Date: 2009-03-30 02:04
I believe this one is out of date, but without a sample test to check verifying is harder...
msg89954 - (view)	Author: Vernon Cole (vernondcole)	Date: 2009-06-30 22:39
Adding Python 2.6 to the list of affected versions - as that is where I found the bug reported in issue 6383 (now superseded by this one.)
msg89959 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-07-01 00:03
Here is a refreshed version of the patch, without the generated files. The patch combines several changes which are fairly independent from each other: - Using the unicode database to generate the functions adds 143 new codepoints to PyUnicode_ToNumeric, and one codepoint to PyUnicode_IsWhitespace. - In addition, PyUnicode_ToNumeric now contains code for all numerics; previously those which are also digits fell in the 'default:' case and were converted with PyUnicode_ToDigit(). This adds 468 new codepoints, but removes the need to call PyUnicode_ToDigit() - The Unihan.txt files (two files to download, 25Mb each) are now parsed, and this adds 73 more codepoints to PyUnicode_ToNumeric. (There are now 1009 entries in this function.) The 3.2.0 version of this file contains two huge numbers: 1e16 and 1e20, I had to widen the type of 'change_record.numeric_changed' from 'int' to 'double'. It is possible that these were removed from the Unicode database between versions 4.1 and 5.1. - the database has a new flag, NUMERIC_MASK, used by PyUnicode_IsNumeric. This adds ~350 lines in the arrays of numbers in unicodetype_db.h If this patch is accepted, the md5 checksum in test_unicodedata.py will need to change.
msg93597 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-10-05 12:33
Marc-Andre, could you comment on this patch? The comments above were made by inspecting the generated code, comparing with the previous version. IMO the only drawback is the increased memory usage.
msg93600 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2009-10-05 12:55
Amaury Forgeot d'Arc wrote: > > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > Marc-Andre, could you comment on this patch? > The comments above were made by inspecting the generated code, comparing > with the previous version. > IMO the only drawback is the increased memory usage. I haven't tried applying the patch, but from reading it, it looks good.
msg93663 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2009-10-06 21:35
Patch applied with r75272. Merged to py3k, adapted and regenerated files with r75274.

History
Date	User	Action	Args
2022-04-11 14:56:20	admin	set	github: 44085
2010-04-01 11:58:25	flox	link	issue1498930 superseder
2009-10-06 21:35:55	amaury.forgeotdarc	set	status: open -> closedresolution: fixedmessages: +
2009-10-05 12:55:02	lemburg	set	messages: +
2009-10-05 12:38:07	amaury.forgeotdarc	set	files: - unicodectype_ucs4-2.patch
2009-10-05 12:37:39	amaury.forgeotdarc	set	files: + unicodectype_ucs4-2.patch
2009-10-05 12:33:21	amaury.forgeotdarc	set	messages: +
2009-07-01 00:03:30	amaury.forgeotdarc	set	files: + unicodedata-2.7.patchnosy: + amaury.forgeotdarcmessages: +
2009-06-30 23:01:22	ezio.melotti	set	nosy: + ezio.melotti
2009-06-30 22:39:43	vernondcole	set	nosy: + vernondcolemessages: + versions: + Python 2.6, Python 3.0
2009-06-30 21:29:30	amaury.forgeotdarc	link	issue6383 superseder
2009-06-30 21:29:30	amaury.forgeotdarc	unlink	issue6383 dependencies
2009-06-30 19:11:47	loewis	link	issue6383 dependencies
2009-03-30 02:04:52	ajaksu2	link	issue1571170 dependencies
2009-03-30 02:04:07	ajaksu2	set	versions: + Python 3.1, Python 2.7nosy: + ajaksu2messages: + type: enhancementstage: test needed
2006-10-05 07:57:32	andersch	create