[Python-Dev] sgmllib Comments (original) (raw)

Sam Ruby rubys at intertwingly.net
Sun Jun 11 22:26:29 CEST 2006


Planet is a feed aggregator written in Python. It depends heavily on SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, and I've submitted a test case and a patch[1] (use or discard the patch, it is the test that I care about).

While looking around, a few things surfaced. For starters, it would seem that the version of sgmllib in SVN HEAD will selectively unescape certain character references that might appear in an attribute. I say selectively, as:

There are a number of issues here. While not unescaping anything is suboptimal, at least the recipient is aware of exactly which characters have been unescaped (i.e., none of them). The proposed solution makes it impossible for the recipient to know which characters are unescaped, and which are original. (Note: feeds often contain such abominations as © which the new code will treat indistinguishably from ©)

Additionally, there is a unicode issue here - one that is shared by handle_charref, but at least that method is overrideable. If unescaping remains, do it for hex character references and for values greather than 8-bits, i.e., use unichr instead of chr if the value is greater than 127.

[1] http://tinyurl.com/j4a6n



More information about the Python-Dev mailing list