Issue 2834: re.IGNORECASE not Unicode-ready (original) (raw)

Issue2834

Created on 2008-05-12 08:44 by sven.siegmund, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
reunicode.patch	pitrou,2008-06-29 01:19
reunicode2.patch	pitrou,2008-06-29 20:21
reunicode3.patch	pitrou,2008-06-29 20:36
reunicode4.patch	pitrou,2008-07-05 21:09
reunicode5.patch	pitrou,2008-07-28 16:39

Messages (24)
msg66715 - (view)	Author: Sven Siegmund (sven.siegmund)	Date: 2008-05-12 08:43
re cannot ignore case of special latin characters: Python 3.0a5 (py3k:62932M, May 9 2008, 16:23:11) [MSC v.1500 32 bit (Intel)] on win32 >>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á' True >>> import re >>> rx = re.compile('Á', re.IGNORECASE) >>> rx.match('á') # should match but won't >>> rx.match('Á') # will match <_sre.SRE_Match object at 0x014B08A8> >>> rx = re.compile('á', re.IGNORECASE) >>> rx.match('Á') # should match but won't >>> rx.match('á') # will match <_sre.SRE_Match object at 0x014B08A8>
msg66727 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-05-12 14:44
Try adding re.LOCALE to the flags. I'm not sure why that is needed but it seems to fix this issue. I still think this is a legitimate bug though.
msg67622 - (view)	Author: Manuel Kaufmann (humitos) *	Date: 2008-06-02 00:23
I have the same error with the re.LOCALE flag... [humitos] [~]$ python3.0 Python 3.0a5+ (py3k:63855, Jun 1 2008, 13:05:09) [GCC 4.1.3 20080114 (prerelease) (Debian 4.1.2-19)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> rx = re.compile('á', re.LOCALE \| re.IGNORECASE) >>> rx.match('Á') >>> rx.match('á') <_sre.SRE_Match object at 0x2b955e204d30> >>> rx = re.compile('Á', re.IGNORECASE	re.LOCALE) >>> rx.match('Á') <_sre.SRE_Match object at 0x2b955e204e00> >>> rx.match('á') >>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á' True >>>
msg68901 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-06-28 19:40
Same here, re.LOCALE doesn't circumvent the problem.
msg68905 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-06-28 20:27
Uh, actually, it works if you specify re.UNICODE. If you don't, the getlower() function in _sre.c falls back to the plain ASCII algorithm. >>> pat = re.compile('Á', re.IGNORECASE \| re.UNICODE) >>> pat.match('á') <_sre.SRE_Match object at 0xb7c66c28> >>> pat.match('Á') <_sre.SRE_Match object at 0xb7c66cd0> I wonder if re.UNICODE shouldn't be the default in Py3k, at least when the pattern is a string and not a bytes object. There may also be a re.ASCII flag for those cases where people want to fallback to the old behaviour.
msg68920 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-06-28 22:19
Sounds like re.UNICODE should be on by default when the pattern is a str instance. Also (per mailing list discussion) we should probably only allow matching bytes when the pattern is bytes, and matching str when the pattern is str. Finally, is there a use case of re.LOCALE any more? I'm thinking not.
msg68922 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-06-28 22:35
Le samedi 28 juin 2008 à 22:20 +0000, Guido van Rossum a écrit : > Finally, is there a use case of re.LOCALE any more? I'm thinking not. It's used for locale-specific case matching in the non-unicode case. But it looks to me like a bad practice and we could probably remove it. 'C' >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE) >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE \|re.LOCALE) >>> locale.setlocale(locale.LC_CTYPE, 'fr_FR.ISO-8859-1') 'fr_FR.ISO-8859-1' >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE) >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE	re.LOCALE) <_sre.SRE_Match object at 0xb7b9ac28>
msg68932 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-06-29 01:15
Here is a preliminary patch which doesn't remove re.LOCALE, but adds TypeError's for mistyped matchings, a ValueError when specifying re.UNICODE with a bytes pattern, and implies re.UNICODE for unicode patterns. The test suite runs fine after a few fixes. It also includes the patch for #3231 ("re.compile fails with some bytes patterns").
msg68966 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-06-29 20:21
This new patch also introduces re.ASCII as discussed on the mailing-list.
msg68967 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-06-29 20:36
Improved patch which also detects incompatibilities for "(?u)".
msg69298 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-07-05 21:09
This new patch adds re.ASCII in all sensitive places I could find in the stdlib (except lib2to3 which as far as I understand is maintained in a separate branch, and even has its own copy of tokenize.py...). Also, I didn't get an answer to the following question on the ML: should an inline flag "(?a)" be introduced to mirror the existing "(?u)" - so as to set the ASCII flag from inside a pattern string.
msg69301 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-07-05 21:30
http://codereview.appspot.com/2439
msg70354 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-07-28 16:39
Final patch adding the (?a) inline flag (equivalent to re.ASCII). Please review: http://codereview.appspot.com/2439
msg70370 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-07-28 20:41
Are all those re.ASCII flags mandatory, or are they here just for theoretical correctness? For example, the output of "gcc -dumpversion" is certainly plain ASCII. I don't mind that \d also matches some exotic digit - it just won't happen.
msg70371 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-07-28 20:49
Le lundi 28 juillet 2008 à 20:41 +0000, Amaury Forgeot d'Arc a écrit : > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > Are all those re.ASCII flags mandatory, or are they here just for > theoretical correctness? For theoretical correctness. I just don't want to analyze each case individually and I'm probably not competent for many of them.
msg70780 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-06 10:29
If nobody (except Amaury :-)) has anything to say about the current patch, should it be committed?
msg70787 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2008-08-06 16:34
Let's make sure the release manager is OK with this.
msg71186 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-15 21:31
Barry?
msg71413 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2008-08-19 12:57
I haven't looked at the specific patch, but based on the description of the behavior, I'm +1 on committing this before beta 3. I'm fine with leaving the re.ASCII flags in there -- it will be a marker to indicate perhaps the code needs a closer examination (eventually).
msg71414 - (view)	Author: Barry A. Warsaw (barry) *	Date: 2008-08-19 12:58
Make sure of course that the documentation is updated and a NEWS file entry is added.
msg71455 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-19 17:59
Fixed in r65860. Someone should check the docs though (at least try to generate them, and review my changes a bit since English isn't my mother tongue).
msg71516 - (view)	Author: Mark Summerfield (mark) *	Date: 2008-08-20 07:36
On 2008-08-19, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Fixed in r65860. Someone should check the docs though (at least try to > generate them, and review my changes a bit since English isn't my mother > tongue). I've revised the ASCII and LOCALE-related texts in re.rst in r65903.
msg71517 - (view)	Author: Mark Summerfield (mark) *	Date: 2008-08-20 07:40
On 2008-08-19, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Fixed in r65860. Someone should check the docs though (at least try to > generate them, and review my changes a bit since English isn't my mother > tongue). And two more (tiny) fixes in r65904; that's my lot:-)
msg71519 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-08-20 08:49
Thanks a lot Mark!

History
Date	User	Action	Args
2022-04-11 14:56:34	admin	set	github: 47083
2009-02-13 14:02:50	ezio.melotti	set	nosy: + ezio.melotti
2009-02-13 13:52:31	ocean-city	link	issue5239 dependencies
2009-02-13 12:34:49	ocean-city	link	issue5240 dependencies
2008-08-20 08:49:38	pitrou	set	messages: +
2008-08-20 07:40:55	mark	set	messages: +
2008-08-20 07:36:30	mark	set	messages: +
2008-08-19 17:59:29	pitrou	set	status: open -> closedresolution: accepted -> fixedmessages: +
2008-08-19 12:58:07	barry	set	messages: +
2008-08-19 12:57:41	barry	set	resolution: acceptedmessages: +
2008-08-15 21:31:11	pitrou	set	messages: +
2008-08-06 16:34:33	gvanrossum	set	nosy: + barrymessages: +
2008-08-06 10:29:16	pitrou	set	messages: +
2008-07-28 20:49:16	pitrou	set	messages: +
2008-07-28 20:41:56	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarcmessages: +
2008-07-28 16:39:31	pitrou	set	files: + reunicode5.patchmessages: +
2008-07-24 15:07:53	pitrou	set	priority: criticalassignee: pitrou
2008-07-24 12:39:00	mark	set	nosy: + mark
2008-07-05 21:30:04	pitrou	set	messages: +
2008-07-05 21:10:11	pitrou	set	files: + reunicode4.patchmessages: +
2008-06-29 20:36:38	pitrou	set	files: + reunicode3.patchmessages: +
2008-06-29 20:21:07	pitrou	set	files: + reunicode2.patchmessages: +
2008-06-29 01:19:44	pitrou	set	files: + reunicode.patch
2008-06-29 01:19:17	pitrou	set	files: - reunicode.patch
2008-06-29 01:15:28	pitrou	set	files: + reunicode.patchkeywords: + patchmessages: +
2008-06-28 22:35:39	pitrou	set	messages: +
2008-06-28 22:19:03	gvanrossum	set	messages: +
2008-06-28 20:27:24	pitrou	set	messages: +
2008-06-28 19:40:35	pitrou	set	nosy: + pitroumessages: +
2008-06-02 00:23:02	humitos	set	nosy: + humitosmessages: +
2008-05-12 14:44:03	gvanrossum	set	nosy: + gvanrossummessages: +
2008-05-12 08:44:03	sven.siegmund	create