sgmllib doesn't support the hexadecimal style of character nor Unicode characters, both of which are commonly seen on web pages. The following replacements fix both problems. charref = re.compile('&#([0-9a-fA-F]+)[^0-9a-fA-F]') def handle_charref(self, ref): try: if ref[0] == 'x' or ref[0] == 'X': m = int(ref[1:], 16) else: m = int(ref) self.handle_data(unichr(m).encode('utf-8')) except ValueError: self.unknown_charref(ref)
Logged In: YES user_id=122141 I don't have the money to shell out for the XML spec, but according to http:// developers.omnimark.com/documentation/concept/764.htm they were added in SGML TC 2.
SGML TC 2 can be found here: http://www1.y12.doe.gov/capabilities/sgml/wg8/document/1955.htm See the section K.4.1 for hexidecimal character references. Since this is really an update to the SGML standard, and not part of the original, any support for this should be an optional feature. It's really only interesting on the web, where standards compliance is... a little on the lax side. It would be reasonable to enable this by default from htmllib (if not already supported in htmllib; I don't remember). I'm fairly sure hex character references are already supported in HTMLParser.
Rejected since this didn't make it into Python 2.7.
History
Date
User
Action
Args
2022-04-10 16:11:04
admin
set
github: 39204
2010-08-02 01:13:52
fdrake
set
status: open -> closedresolution: rejectedmessages: +
2010-08-01 03:45:12
meatballhat
set
nosy: + meatballhatmessages: + title: gmllib doesn't support hex or Unicode character references -> sgmllib doesn't support hex or Unicode character references
2010-08-01 03:43:28
meatballhat
set
title: sgmllib doesn't support hex or Unicode character references -> gmllib doesn't support hex or Unicode character references