Issue 34294: re module: wrong capturing groups (original) (raw)

Created on 2018-07-31 13:11 by beardypig, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 11546 merged malin,2019-01-14 01:26
PR 11546 merged malin,2019-01-14 01:26
PR 11546 merged malin,2019-01-14 01:26
PR 11919 merged miss-islington,2019-02-18 13:27
Messages (10)
msg322771 - (view) Author: beardypig (beardypig) Date: 2018-07-31 13:11
I am experiencing and issue with the following regex when using finditer. (?=<(?P\w+)/?>(?:(?P.+?)</(?P=tag)>)?)", " (I know it's not the best method of dealing with HTML, and this is a simplified version) For example: [m.groupdict() for m in re.finditer(r"(?=<(?P\w+)/?>(?:(?P.+?)</(?P=tag)>)?)", "")] In Python 2.7, 3.5, and 3.6 it returns [{'tag': 'test', 'text': ''}, {'tag': 'foo2', 'text': None}] But starting with 3.7 it returns [{'tag': 'test', 'text': ''}, {'tag': 'foo2', 'text': ''}] The "text" group appears to be a copy of the previous "text" group. Some other examples: "Hello" => [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': 'Hello'}] (expected: [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': None}]) "Hello" => [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': 'Hello'}, {'tag': 'foo', 'text': None}] (expected: [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': None}, {'tag': 'foo', 'text': None}])
msg322799 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-07-31 16:42
➜ cpython git:(70d56fb525) ✗ ./python.exe Python 3.7.0a2+ (tags/v3.7.0a2-341-g70d56fb525:70d56fb525, Jul 31 2018, 21:58:10) [Clang 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> ➜ cpython git:(70d56fb525) ✗ ./python.exe -c 'import re; print([m.groupdict() for m in re.finditer(r"(?=<(?P\w+)/?>(?:(?P.+?)</(?P=tag)>)?)", "")])' [{'tag': 'test', 'text': ''}, {'tag': 'foo2', 'text': ''}] ➜ cpython git:(e69fbb6a56) ✗ ./python.exe Python 3.7.0a2+ (tags/v3.7.0a2-340-ge69fbb6a56:e69fbb6a56, Jul 31 2018, 22:12:06) [Clang 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> ➜ cpython git:(e69fbb6a56) ✗ ./python.exe -c 'import re; print([m.groupdict() for m in re.finditer(r"(?=<(?P\w+)/?>(?:(?P.+?)</(?P=tag)>)?)", "")])' [{'tag': 'test', 'text': ''}, {'tag': 'foo2', 'text': None}] Does this have something to do with 70d56fb52582d9d3f7c00860d6e90570c6259371(bpo-25054, bpo-1647489) ? Thanks
msg324990 - (view) Author: Ma Lin (malin) * Date: 2018-09-11 05:47
This bug generates wrong results silently, so I suggest mark it as release blocker for 3.7.1
msg333548 - (view) Author: Ma Lin (malin) * Date: 2019-01-13 08:10
Simplify the test-case, it seem the `state` is not reset properly. Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) >>> import re >>> re.findall(r"(?=(<\w+>)(<\w+>)?)", "") [('', ''), ('', '')] Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) >>> import re >>> re.findall(r"(?=(<\w+>)(<\w+>)?)", "") [('', ''), ('', '')]
msg333580 - (view) Author: Ma Lin (malin) * Date: 2019-01-14 03:08
I tried to fix it, feel free to create a new PR if you don't want this one. PR11546 has a small question, should `state->data_stack` be dealloced as well? FYI, function `state_reset(SRE_STATE* state)` in file `_sre.c`: https://github.com/python/cpython/blob/d4f9cf5545d6d8844e0726552ef2e366f5cc3abd/Modules/_sre.c#L340-L352
msg334078 - (view) Author: Ma Lin (malin) * Date: 2019-01-20 03:58
Serhiy Storchaka lost his sight. Please stop any work and rest, because your left eye will have more burden, and your mental burden will make it worse. Go to hospital ASAP. If any other core developer want to review this patch, I would like to give a detailed explanation, the logic is not very compilcated.
msg334139 - (view) Author: Ma Lin (malin) * Date: 2019-01-21 14:34
Original post's bug was introduced in Python 3.7.0 When investigate the code, I found another bug about capturing groups. This bug exists since very early version. regex module doesn't have this bug. Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)] on win32 >>> import re >>> re.search(r"\b(?=(\t)|(x))x", "a\tx").groups() ('', 'x') Expected result: (None, 'x') Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32 >>> import regex >>> regex.search(r"\b(?=(\t) (x))x", "a\tx").groups() (None, 'x')
msg335832 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-18 13:26
New changeset 4a7f44a2ed49ff1e87db062e7177a56c6e4bbdb0 by Serhiy Storchaka (animalize) in branch 'master': bpo-34294: re module, fix wrong capturing groups in rare cases. (GH-11546) https://github.com/python/cpython/commit/4a7f44a2ed49ff1e87db062e7177a56c6e4bbdb0
msg335833 - (view) Author: miss-islington (miss-islington) Date: 2019-02-18 13:48
New changeset 0e379d43acc25277f02262212932d3c589a2031b by Miss Islington (bot) in branch '3.7': bpo-34294: re module, fix wrong capturing groups in rare cases. (GH-11546) https://github.com/python/cpython/commit/0e379d43acc25277f02262212932d3c589a2031b
msg335836 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-02-18 14:10
Thank you for your PR Ma Lin!
History
Date User Action Args
2022-04-11 14:59:04 admin set github: 78475
2019-02-18 14:10:13 serhiy.storchaka set status: open -> closedmessages: + keywords: + 3.7regressionresolution: fixedstage: patch review -> resolved
2019-02-18 13:48:26 miss-islington set nosy: + miss-islingtonmessages: +
2019-02-18 13:27:45 miss-islington set pull_requests: + <pull%5Frequest11944>
2019-02-18 13:26:46 serhiy.storchaka set messages: +
2019-01-21 14:34:58 malin set messages: + title: re.finditer and lookahead bug -> re module: wrong capturing groups
2019-01-20 03:58:27 malin set messages: +
2019-01-14 03:08:06 malin set messages: +
2019-01-14 01:27:14 malin set keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest11164>
2019-01-14 01:27:04 malin set keywords: + patchstage: (no value)pull_requests: + <pull%5Frequest11163>
2019-01-14 01:26:51 malin set keywords: + patchstage: (no value)pull_requests: + <pull%5Frequest11162>
2019-01-13 08:10:14 malin set messages: +
2018-09-11 05:47:21 malin set nosy: + malinmessages: +
2018-07-31 16:42:45 xtreak set messages: +
2018-07-31 16:17:15 xtreak set nosy: + xtreak
2018-07-31 13:20:51 serhiy.storchaka set assignee: serhiy.storchakanosy: + serhiy.storchaka
2018-07-31 13:11:05 beardypig create