Issue 28690: Loop in re (regular expression) processing (original) (raw)

Created on 2016-11-14 16:24 by Walter Farrell, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (4)
msg280790 - (view) Author: Walter Farrell (Walter Farrell) Date: 2016-11-14 16:24
Given: pattern = r"(^|[^\\])<(pm [^ ]+( + '[^']*' \"[^\"]*\"
msg280813 - (view) Author: Gareth Rees (gdr@garethrees.org) * (Python triager) Date: 2016-11-14 20:55
This is a well-known gotcha with backtracking regexp implementations. The problem is that in the alternation "( +|'[^']*' \"[^\"]*\" [^>]+)" there are some characters (space, apostrophe, double quotes) that match multiple alternatives (for example a space matches both " +" and "[^>]+"). This causes the regexp engine to have to backtrack for each ambiguous character to try out the other alternatives, leading to runtime that's exponential in the number of ambiguous characters. Linear behaviour can be restored if you make the alternation unambiguous, like this: ( +
msg280814 - (view) Author: Walter Farrell (Walter Farrell) Date: 2016-11-14 21:09
Thanks, Gareth. That does work. Interesting that regex does still seem to work linearly with the original version, but your version seems cleaner. On Mon, Nov 14, 2016 at 3:55 PM, Gareth Rees <report@bugs.python.org> wrote: > > Gareth Rees added the comment: > > This is a well-known gotcha with backtracking regexp implementations. The > problem is that in the alternation "( +|'[^']*' \"[^\"]*\" [^>]+)" there > are some characters (space, apostrophe, double quotes) that match multiple > alternatives (for example a space matches both " +" and "[^>]+"). This > causes the regexp engine to have to backtrack for each ambiguous character > to try out the other alternatives, leading to runtime that's exponential in > the number of ambiguous characters. > > Linear behaviour can be restored if you make the alternation unambiguous, > like this: ( +
msg280829 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-11-15 10:03
It's not really a bug, but more a trap of regular expressions. It seems like you fixed your issue, so I close it.
History
Date User Action Args
2022-04-11 14:58:39 admin set github: 72876
2016-11-15 10:03:49 vstinner set status: open -> closednosy: + vstinnermessages: + resolution: not a bug
2016-11-14 21:09:42 Walter Farrell set messages: +
2016-11-14 20:55:25 gdr@garethrees.org set nosy: + gdr@garethrees.orgmessages: +
2016-11-14 16:24:47 Walter Farrell create