[Python-Dev] Changes in html.parser may cause breakage in client code (original) (raw)
Vinay Sajip vinay_sajip at yahoo.co.uk
Thu Apr 26 21:10:48 CEST 2012
- Previous message: [Python-Dev] [Python-checkins] cpython: Close #10142: Support for SEEK_HOLE/SEEK_DATA
- Next message: [Python-Dev] Changes in html.parser may cause breakage in client code
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Following recent changes in html.parser, the Python 3 port of Django I'm working on has started failing while parsing HTML.
The reason appears to be that Django uses some module-level data in html.parser, for example tagfind, which is a regular expression pattern. This has changed recently (Ezio changed it in ba4baaddac8d).
Now tagfind (and other such patterns) are not marked as private (though not documented), but should they be? The following script (tagfind.py):
import html.parser as Parser
data = '<select name="stuff">'
m = Parser.tagfind.match(data, 1)
print('%r -> %r' % (Parser.tagfind.pattern, data[1:m.end()]))
gives different results on 3.2 and 3.3:
$ python3.2 tagfind.py
'[a-zA-Z][-.a-zA-Z0-9:_]*' -> 'select'
$ python3.3 tagfind.py
'([a-zA-Z][-.a-zA-Z0-9:_]*)(?:\\s|/(?!>))*' -> 'select '
The trailing space later causes a mismatch with the end tag, and leads to the errors. Django's use of the tagfind pattern is in a subclass of HTMLParser, in an overridden parse_startag method.
Do we need to indicate more strongly that data like tagfind are private? Or has the change introduced inadvertent breakage, requiring a fix in Python?
Regards,
Vinay Sajip
- Previous message: [Python-Dev] [Python-checkins] cpython: Close #10142: Support for SEEK_HOLE/SEEK_DATA
- Next message: [Python-Dev] Changes in html.parser may cause breakage in client code
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]