Issue 803422: sgmllib doesn't support hex or Unicode character references (original) (raw)

Created on 2003-09-09 20:53 by aaronsw, last changed 2022-04-10 16:11 by admin. This issue is now closed.

Messages (7)
msg60380 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-09 20:53
sgmllib doesn't support the hexadecimal style of character nor Unicode characters, both of which are commonly seen on web pages. The following replacements fix both problems. charref = re.compile('&#([0-9a-fA-F]+)[^0-9a-fA-F]') def handle_charref(self, ref): try: if ref[0] == 'x' or ref[0] == 'X': m = int(ref[1:], 16) else: m = int(ref) self.handle_data(unichr(m).encode('utf-8')) except ValueError: self.unknown_charref(ref)
msg60381 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-09 21:00
Logged In: YES user_id=122141 Oops, that should be: charref = re.compile('&#([0-9a-fA-FxX][0-9a-fA-F]*)[^0-9a-fA-F]')
msg60382 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-09-10 16:58
Logged In: YES user_id=21627 Are you sure hexadecimal character references are part of the SGML standard?
msg60383 - (view) Author: Aaron Swartz (aaronsw) Date: 2003-09-10 22:42
Logged In: YES user_id=122141 I don't have the money to shell out for the XML spec, but according to http:// developers.omnimark.com/documentation/concept/764.htm they were added in SGML TC 2.
msg63530 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2008-03-14 16:30
SGML TC 2 can be found here: http://www1.y12.doe.gov/capabilities/sgml/wg8/document/1955.htm See the section K.4.1 for hexidecimal character references. Since this is really an update to the SGML standard, and not part of the original, any support for this should be an optional feature. It's really only interesting on the web, where standards compliance is... a little on the lax side. It would be reasonable to enable this by default from htmllib (if not already supported in htmllib; I don't remember). I'm fairly sure hex character references are already supported in HTMLParser.
msg112262 - (view) Author: Dan Buch (meatballhat) Date: 2010-08-01 03:45
gads ... didn't mean to submit a title change there Since this is removed from Python 3, should the status be changed to Rejected?
msg112414 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-08-02 01:13
Rejected since this didn't make it into Python 2.7.
History
Date User Action Args
2022-04-10 16:11:04 admin set github: 39204
2010-08-02 01:13:52 fdrake set status: open -> closedresolution: rejectedmessages: +
2010-08-01 03:45:12 meatballhat set nosy: + meatballhatmessages: + title: gmllib doesn't support hex or Unicode character references -> sgmllib doesn't support hex or Unicode character references
2010-08-01 03:43:28 meatballhat set title: sgmllib doesn't support hex or Unicode character references -> gmllib doesn't support hex or Unicode character references
2009-04-22 17:21:11 ajaksu2 set keywords: + easy
2009-02-13 03:41:24 ajaksu2 set priority: normal -> lowstage: test neededtype: enhancementversions: + Python 2.7, - Python 2.3
2008-03-14 16:30:02 fdrake set nosy: + fdrakemessages: +
2003-09-09 20:53:13 aaronsw create