msg66715 - (view) |
Author: Sven Siegmund (sven.siegmund) |
Date: 2008-05-12 08:43 |
re cannot ignore case of special latin characters: Python 3.0a5 (py3k:62932M, May 9 2008, 16:23:11) [MSC v.1500 32 bit (Intel)] on win32 >>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á' True >>> import re >>> rx = re.compile('Á', re.IGNORECASE) >>> rx.match('á') # should match but won't >>> rx.match('Á') # will match <_sre.SRE_Match object at 0x014B08A8> >>> rx = re.compile('á', re.IGNORECASE) >>> rx.match('Á') # should match but won't >>> rx.match('á') # will match <_sre.SRE_Match object at 0x014B08A8> |
|
|
msg66727 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2008-05-12 14:44 |
Try adding re.LOCALE to the flags. I'm not sure why that is needed but it seems to fix this issue. I still think this is a legitimate bug though. |
|
|
msg67622 - (view) |
Author: Manuel Kaufmann (humitos) * |
Date: 2008-06-02 00:23 |
I have the same error with the re.LOCALE flag... [humitos] [~]$ python3.0 Python 3.0a5+ (py3k:63855, Jun 1 2008, 13:05:09) [GCC 4.1.3 20080114 (prerelease) (Debian 4.1.2-19)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> rx = re.compile('á', re.LOCALE | re.IGNORECASE) >>> rx.match('Á') >>> rx.match('á') <_sre.SRE_Match object at 0x2b955e204d30> >>> rx = re.compile('Á', re.IGNORECASE |
re.LOCALE) >>> rx.match('Á') <_sre.SRE_Match object at 0x2b955e204e00> >>> rx.match('á') >>> 'Á'.lower() == 'á' and 'á'.upper() == 'Á' True >>> |
|
msg68901 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-28 19:40 |
Same here, re.LOCALE doesn't circumvent the problem. |
|
|
msg68905 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-28 20:27 |
Uh, actually, it works if you specify re.UNICODE. If you don't, the getlower() function in _sre.c falls back to the plain ASCII algorithm. >>> pat = re.compile('Á', re.IGNORECASE | re.UNICODE) >>> pat.match('á') <_sre.SRE_Match object at 0xb7c66c28> >>> pat.match('Á') <_sre.SRE_Match object at 0xb7c66cd0> I wonder if re.UNICODE shouldn't be the default in Py3k, at least when the pattern is a string and not a bytes object. There may also be a re.ASCII flag for those cases where people want to fallback to the old behaviour. |
|
|
msg68920 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2008-06-28 22:19 |
Sounds like re.UNICODE should be on by default when the pattern is a str instance. Also (per mailing list discussion) we should probably only allow matching bytes when the pattern is bytes, and matching str when the pattern is str. Finally, is there a use case of re.LOCALE any more? I'm thinking not. |
|
|
msg68922 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-28 22:35 |
Le samedi 28 juin 2008 à 22:20 +0000, Guido van Rossum a écrit : > Finally, is there a use case of re.LOCALE any more? I'm thinking not. It's used for locale-specific case matching in the non-unicode case. But it looks to me like a bad practice and we could probably remove it. 'C' >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE) >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE |re.LOCALE) >>> locale.setlocale(locale.LC_CTYPE, 'fr_FR.ISO-8859-1') 'fr_FR.ISO-8859-1' >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE) >>> re.match('À'.encode('latin1'), 'à'.encode('latin1'), re.IGNORECASE |
re.LOCALE) <_sre.SRE_Match object at 0xb7b9ac28> |
|
msg68932 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-29 01:15 |
Here is a preliminary patch which doesn't remove re.LOCALE, but adds TypeError's for mistyped matchings, a ValueError when specifying re.UNICODE with a bytes pattern, and implies re.UNICODE for unicode patterns. The test suite runs fine after a few fixes. It also includes the patch for #3231 ("re.compile fails with some bytes patterns"). |
|
|
msg68966 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-29 20:21 |
This new patch also introduces re.ASCII as discussed on the mailing-list. |
|
|
msg68967 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-29 20:36 |
Improved patch which also detects incompatibilities for "(?u)". |
|
|
msg69298 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-07-05 21:09 |
This new patch adds re.ASCII in all sensitive places I could find in the stdlib (except lib2to3 which as far as I understand is maintained in a separate branch, and even has its own copy of tokenize.py...). Also, I didn't get an answer to the following question on the ML: should an inline flag "(?a)" be introduced to mirror the existing "(?u)" - so as to set the ASCII flag from inside a pattern string. |
|
|
msg69301 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-07-05 21:30 |
http://codereview.appspot.com/2439 |
|
|
msg70354 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-07-28 16:39 |
Final patch adding the (?a) inline flag (equivalent to re.ASCII). Please review: http://codereview.appspot.com/2439 |
|
|
msg70370 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2008-07-28 20:41 |
Are all those re.ASCII flags mandatory, or are they here just for theoretical correctness? For example, the output of "gcc -dumpversion" is certainly plain ASCII. I don't mind that \d also matches some exotic digit - it just won't happen. |
|
|
msg70371 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-07-28 20:49 |
Le lundi 28 juillet 2008 à 20:41 +0000, Amaury Forgeot d'Arc a écrit : > Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment: > > Are all those re.ASCII flags mandatory, or are they here just for > theoretical correctness? For theoretical correctness. I just don't want to analyze each case individually and I'm probably not competent for many of them. |
|
|
msg70780 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-08-06 10:29 |
If nobody (except Amaury :-)) has anything to say about the current patch, should it be committed? |
|
|
msg70787 - (view) |
Author: Guido van Rossum (gvanrossum) *  |
Date: 2008-08-06 16:34 |
Let's make sure the release manager is OK with this. |
|
|
msg71186 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-08-15 21:31 |
Barry? |
|
|
msg71413 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2008-08-19 12:57 |
I haven't looked at the specific patch, but based on the description of the behavior, I'm +1 on committing this before beta 3. I'm fine with leaving the re.ASCII flags in there -- it will be a marker to indicate perhaps the code needs a closer examination (eventually). |
|
|
msg71414 - (view) |
Author: Barry A. Warsaw (barry) *  |
Date: 2008-08-19 12:58 |
Make sure of course that the documentation is updated and a NEWS file entry is added. |
|
|
msg71455 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-08-19 17:59 |
Fixed in r65860. Someone should check the docs though (at least try to generate them, and review my changes a bit since English isn't my mother tongue). |
|
|
msg71516 - (view) |
Author: Mark Summerfield (mark) * |
Date: 2008-08-20 07:36 |
On 2008-08-19, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Fixed in r65860. Someone should check the docs though (at least try to > generate them, and review my changes a bit since English isn't my mother > tongue). I've revised the ASCII and LOCALE-related texts in re.rst in r65903. |
|
|
msg71517 - (view) |
Author: Mark Summerfield (mark) * |
Date: 2008-08-20 07:40 |
On 2008-08-19, Antoine Pitrou wrote: > Antoine Pitrou <pitrou@free.fr> added the comment: > > Fixed in r65860. Someone should check the docs though (at least try to > generate them, and review my changes a bit since English isn't my mother > tongue). And two more (tiny) fixes in r65904; that's my lot:-) |
|
|
msg71519 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-08-20 08:49 |
Thanks a lot Mark! |
|
|