msg322771 - (view) |
Author: beardypig (beardypig) |
Date: 2018-07-31 13:11 |
I am experiencing and issue with the following regex when using finditer. (?=<(?P\w+)/?>(?:(?P.+?)</(?P=tag)>)?)", " (I know it's not the best method of dealing with HTML, and this is a simplified version) For example: [m.groupdict() for m in re.finditer(r"(?=<(?P\w+)/?>(?:(?P.+?)</(?P=tag)>)?)", "")] In Python 2.7, 3.5, and 3.6 it returns [{'tag': 'test', 'text': ''}, {'tag': 'foo2', 'text': None}] But starting with 3.7 it returns [{'tag': 'test', 'text': ''}, {'tag': 'foo2', 'text': ''}] The "text" group appears to be a copy of the previous "text" group. Some other examples: "Hello" => [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': 'Hello'}] (expected: [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': None}]) "Hello" => [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': 'Hello'}, {'tag': 'foo', 'text': None}] (expected: [{'tag': 'test', 'text': 'Hello'}, {'tag': 'foo', 'text': None}, {'tag': 'foo', 'text': None}]) |
|
|
msg322799 - (view) |
Author: Karthikeyan Singaravelan (xtreak) *  |
Date: 2018-07-31 16:42 |
➜ cpython git:(70d56fb525) ✗ ./python.exe Python 3.7.0a2+ (tags/v3.7.0a2-341-g70d56fb525:70d56fb525, Jul 31 2018, 21:58:10) [Clang 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> ➜ cpython git:(70d56fb525) ✗ ./python.exe -c 'import re; print([m.groupdict() for m in re.finditer(r"(?=<(?P\w+)/?>(?:(?P.+?)</(?P=tag)>)?)", "")])' [{'tag': 'test', 'text': ''}, {'tag': 'foo2', 'text': ''}] ➜ cpython git:(e69fbb6a56) ✗ ./python.exe Python 3.7.0a2+ (tags/v3.7.0a2-340-ge69fbb6a56:e69fbb6a56, Jul 31 2018, 22:12:06) [Clang 7.0.2 (clang-700.1.81)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> ➜ cpython git:(e69fbb6a56) ✗ ./python.exe -c 'import re; print([m.groupdict() for m in re.finditer(r"(?=<(?P\w+)/?>(?:(?P.+?)</(?P=tag)>)?)", "")])' [{'tag': 'test', 'text': ''}, {'tag': 'foo2', 'text': None}] Does this have something to do with 70d56fb52582d9d3f7c00860d6e90570c6259371(bpo-25054, bpo-1647489) ? Thanks |
|
|
msg324990 - (view) |
Author: Ma Lin (malin) * |
Date: 2018-09-11 05:47 |
This bug generates wrong results silently, so I suggest mark it as release blocker for 3.7.1 |
|
|
msg333548 - (view) |
Author: Ma Lin (malin) * |
Date: 2019-01-13 08:10 |
Simplify the test-case, it seem the `state` is not reset properly. Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) >>> import re >>> re.findall(r"(?=(<\w+>)(<\w+>)?)", "") [('', ''), ('', '')] Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) >>> import re >>> re.findall(r"(?=(<\w+>)(<\w+>)?)", "") [('', ''), ('', '')] |
|
|
msg333580 - (view) |
Author: Ma Lin (malin) * |
Date: 2019-01-14 03:08 |
I tried to fix it, feel free to create a new PR if you don't want this one. PR11546 has a small question, should `state->data_stack` be dealloced as well? FYI, function `state_reset(SRE_STATE* state)` in file `_sre.c`: https://github.com/python/cpython/blob/d4f9cf5545d6d8844e0726552ef2e366f5cc3abd/Modules/_sre.c#L340-L352 |
|
|
msg334078 - (view) |
Author: Ma Lin (malin) * |
Date: 2019-01-20 03:58 |
Serhiy Storchaka lost his sight. Please stop any work and rest, because your left eye will have more burden, and your mental burden will make it worse. Go to hospital ASAP. If any other core developer want to review this patch, I would like to give a detailed explanation, the logic is not very compilcated. |
|
|
msg334139 - (view) |
Author: Ma Lin (malin) * |
Date: 2019-01-21 14:34 |
Original post's bug was introduced in Python 3.7.0 When investigate the code, I found another bug about capturing groups. This bug exists since very early version. regex module doesn't have this bug. Python 3.4.4 (v3.4.4:737efcadf5a6, Dec 20 2015, 19:28:18) [MSC v.1600 32 bit (Intel)] on win32 >>> import re >>> re.search(r"\b(?=(\t)|(x))x", "a\tx").groups() ('', 'x') Expected result: (None, 'x') Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32 >>> import regex >>> regex.search(r"\b(?=(\t) |
(x))x", "a\tx").groups() (None, 'x') |
|
msg335832 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2019-02-18 13:26 |
New changeset 4a7f44a2ed49ff1e87db062e7177a56c6e4bbdb0 by Serhiy Storchaka (animalize) in branch 'master': bpo-34294: re module, fix wrong capturing groups in rare cases. (GH-11546) https://github.com/python/cpython/commit/4a7f44a2ed49ff1e87db062e7177a56c6e4bbdb0 |
|
|
msg335833 - (view) |
Author: miss-islington (miss-islington) |
Date: 2019-02-18 13:48 |
New changeset 0e379d43acc25277f02262212932d3c589a2031b by Miss Islington (bot) in branch '3.7': bpo-34294: re module, fix wrong capturing groups in rare cases. (GH-11546) https://github.com/python/cpython/commit/0e379d43acc25277f02262212932d3c589a2031b |
|
|
msg335836 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2019-02-18 14:10 |
Thank you for your PR Ma Lin! |
|
|