Issue 26784: regular expression problem at umlaut handling (original) (raw)
Created on 2016-04-16 16:48 by arbyter, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (6) | ||
---|---|---|
msg263567 - (view) | Author: Marcus (arbyter) | Date: 2016-04-16 16:48 |
Working with this example string "E-112233-555-11 | Bläh - Bläh" with the following code leeds under python 2.7.10 (OSX) to an exception whereas the same code works under python 3.5.1 (OSX). s = "E-112233-555-11 | Bläh - Bläh" expr = re.compile(r"(?P [A-Z]{1}-[0-9]{0,}(-[0-9]{0,}(-[0-9]{0,})?)?)?(( [ |
] )?(?P[\s\w]*)?)? - (?P[\s\w]*)?",re.UNICODE) res = re.match(expr,s) a = (res.group('p'), res.group('a'), res.group('j')) print(a) When I change the first umlaut in "Bläh" from ä to ü it works as expected on python 2 and 3. A change from ä to ö however leeds to a crash again. Ideas? |
msg263569 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2016-04-16 17:41 |
First, in the context of Python a crash means a core dump or an analogue on Windows. In this case the code just works not as you expected. The short answer: s should be a unicode. In your code "ä" is encoded as 8-bit string '\xc3\xa4'. When matched, every bytes is independently expanded to Unicode range. The first byte becomes u'\xc3' = u'Ã', the second byte becomes u'¤', non-alphanumeric. '[\s\w]*' doesn't match u'ä'. "ü" is encoded as 8-bit string '\xc3\xbc'. The second byte becomes u'¼', numeric. '[\s\w]*' matches u'ü'. | ||
msg263570 - (view) | Author: Marcus (arbyter) | Date: 2016-04-16 17:54 |
Thx for your explanation. You explained why [\s\w] didn't match for "ä". In my situation it didn't matches for the first "ä" but the second time I used [\s\w] in the same regex it matched at the second "ä". What's the explanation for this? | ||
msg263572 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2016-04-16 18:10 |
Sorry, I don't understand you. If the regex failed to match the first "ä", it can't match the second "ä". Do you have an example? | ||
msg263575 - (view) | Author: Marcus (arbyter) | Date: 2016-04-16 18:32 |
When I replace the first "ä" with a random letter the untouched expression has not problems to match the second word which contains also an "ä" s = "E-112233-555-11 | Bläh - Bläh" #untuched string s = "E-112233-555-11 | Bloh - Bläh" #string where the first ä is replaced by an "o" | |
msg263577 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * ![]() |
Date: 2016-04-16 18:48 |
Because "[\s\w]*" matches only a part of "Bläh": "Bl\xc3". |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:58:29 | admin | set | github: 70971 |
2016-04-16 18:48:00 | serhiy.storchaka | set | messages: + |
2016-04-16 18:32:35 | arbyter | set | messages: + |
2016-04-16 18:10:10 | serhiy.storchaka | set | messages: + |
2016-04-16 17:54:11 | arbyter | set | messages: + |
2016-04-16 17:41:33 | serhiy.storchaka | set | status: open -> closedresolution: not a bugmessages: + stage: resolved |
2016-04-16 17:11:18 | SilentGhost | set | nosy: + ezio.melotti, pitrou, serhiy.storchaka, mrabarnettcomponents: + Regular Expressions |
2016-04-16 16:48:11 | arbyter | create |