HTMLParser doesn't currently support entities in attributes, like this: foo This patch fixes that. Simply replace the unescape in HTMLParser.py with: import htmlentitydefs def unescape(self, s): def replaceEntities(s): s = s.groups()[0] if s[0] == "#": s = s[1:] if s[0] in ['x','X']: c = int(s[1:], 16) else: c = int(s) return unichr(c) else: return unichr(htmlentitydefs.name2codepoint[c]) return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
Thanks for the patch. Committed as r54165, with the following changes: - added documentation changes - added testsuite changes - fixed incorrect usage of c in name2codepoint[c] (should be [s]) - included ' in the list of supported entities, for compatibility with older versions of HTMLParser - fall back to replacing an unsupported entity reference with &name;