This is a well-known gotcha with backtracking regexp implementations. The problem is that in the alternation "( +|'[^']*'
\"[^\"]*\"
[^>]+)" there are some characters (space, apostrophe, double quotes) that match multiple alternatives (for example a space matches both " +" and "[^>]+"). This causes the regexp engine to have to backtrack for each ambiguous character to try out the other alternatives, leading to runtime that's exponential in the number of ambiguous characters. Linear behaviour can be restored if you make the alternation unambiguous, like this: ( +
Thanks, Gareth. That does work. Interesting that regex does still seem to work linearly with the original version, but your version seems cleaner. On Mon, Nov 14, 2016 at 3:55 PM, Gareth Rees <report@bugs.python.org> wrote: > > Gareth Rees added the comment: > > This is a well-known gotcha with backtracking regexp implementations. The > problem is that in the alternation "( +|'[^']*'
\"[^\"]*\"
[^>]+)" there > are some characters (space, apostrophe, double quotes) that match multiple > alternatives (for example a space matches both " +" and "[^>]+"). This > causes the regexp engine to have to backtrack for each ambiguous character > to try out the other alternatives, leading to runtime that's exponential in > the number of ambiguous characters. > > Linear behaviour can be restored if you make the alternation unambiguous, > like this: ( +