Issue 1459279: sgmllib.SGMLparser and hexadecimal numeric character refs (original) (raw)

Issue1459279

Created on 2006-03-27 12:51 by nerby, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (3)
msg60894 - (view) Author: Francesco Ricciardi (nerby) Date: 2006-03-27 12:51
According to HTML 4.0 specification it is possible to have hexadecimal numeric character references, not only decimal (see http://www.w3.org/TR/REC-html40/charset.html#h-5.3.1). However sgmllib.SGMLparser does not recognize the hexadecimal form. More and more HTML pages now use entities with a high codepoint, not in the official HTML entity list, so proper handling to these references should be implemented. A possible solution could be: - improving the "charref" regular expression, so to include exadecimal values; - considering all numeric references valid: those with n < 255 should be converted to the corresponding characters, those above 255 should be left as numerical charrefs.
msg109853 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-10 11:21
sgmllib has been removed from py3k.
msg114670 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-22 10:45
sgmllib has been deprecated since 2.6 and has been removed from py3k.
History
Date User Action Args
2022-04-11 14:56:16 admin set github: 43097
2010-08-22 10:45:52 BreamoreBoy set status: open -> closedresolution: out of datemessages: + versions: + Python 3.2, - Python 2.7
2010-07-10 11:21:22 BreamoreBoy set nosy: + BreamoreBoymessages: + versions: - Python 3.1
2009-04-22 12:45:50 ajaksu2 set keywords: + easy
2009-03-21 02:02:53 ajaksu2 set stage: test neededtype: enhancementversions: + Python 3.1, Python 2.7, - Python 2.4
2006-03-27 12:51:59 nerby create