Issue 1566086: RE (regular expression) matching stuck in loop (original) (raw)
Logged In: YES user_id=11375
I haven't dug very far into the code, but suspect this isn't a bug in the regex code.
The pattern uses lots of .? subpatterns, and this often means the pattern takes a long time to fail if it isn't going to match. The regex engine matches the group, and then there's a .?, followed by . The engine looks at every character and if it sees a , tries another .*?. This is O(n**2) where n is the number of character in the string being searched, and that string is 93,000 characters long. If you limit the string to 5K or so, the match fails pretty quickly.
I strongly suggest working with the HTML. You could run the HTML through tidy to convert to XHTML and use ElementTree on the resulting XML.