Issue 793753: sgmllib parser problem (original) (raw)
Hi,
I notices that the parser htmllib.HTMLParser has a problem with parsing the following code:
<select value"" tabindex="11" name="status">
The problem is the missing = after 'value'. As a result, it will not report the tabindex and name attributes to the start_select method (it just stops parsing).
I'm unsure if you aim at parsing every broken HTML code out there, so maybe this is not a valid bug report. Anyway, this code snipplet can be found at http://81.180.95.165/debtreduction/ (NB: If you use Mozilla's view source function you won't see the correct HTML code because Mozilla automatically changes it to ...value tabindex=...)
I found that changing sgmllib.py line 35 from
attrfind = re.compile( r'\s*([a-zA-Z_][-:.a-zA-Z_0-9])(\s=\s*'
r'('[^']'|"[^"]"|[-a-zA-Z0-9./,:;+%?!&$()_#=~'"@]))?')
to
attrfind = re.compile( r'\s*([a-zA-Z_][-:.a-zA-Z_0-9])(\s=?\s*'
r'('[^']'|"[^"]"|[-a-zA-Z0-9./,:;+%?!&$()_#=~'"@]))?')
fixes this (I inserted a ? after the =). I cannot say if it breaks something else instead.
Walter Hofmann, walterh@gmx.de