[Python-Dev] sgmllib Comments (original) (raw)
Sam Ruby rubys at intertwingly.net
Mon Jun 12 06:01:23 CEST 2006
- Previous message: [Python-Dev] sgmllib Comments
- Next message: [Python-Dev] sgmllib Comments
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Fred L. Drake, Jr. wrote:
On Sunday 11 June 2006 16:26, Sam Ruby wrote: > Planet is a feed aggregator written in Python. It depends heavily on > SGMLLib. A recent bug report turned out to be a deficiency in sgmllib, > and I've submitted a test case and a patch[1] (use or discard the patch, > it is the test that I care about).
And it's a nice aggregator to use, indeed! > While looking around, a few things surfaced. For starters, it would > seem that the version of sgmllib in SVN HEAD will selectively unescape > certain character references that might appear in an attribute. I say > selectively, as: > > * it will unescape & > * it won't unescape © > * it will unescape & > * it won't unescape & > * it will unescape ’ > * it won't unescape ’ And just why would you use sgmllib to handle RSS or ATOM feeds? Neither is defined in terms of SGML. The sgmllib documentation also notes that it isn't really a fully general SGML parser (it isn't), but that it exists primarily as a foundation for htmllib.
The feed itself is read first with SAX (then with a fallback using sgmllib if the feed is not well formed, but that's beside the point). Then the embedded HTML portions are then processed with subclasses of sgmllib.
> There are a number of issues here. While not unescaping anything is > suboptimal, at least the recipient is aware of exactly which characters > have been unescaped (i.e., none of them). The proposed solution makes > it impossible for the recipient to know which characters are unescaped, > and which are original. (Note: feeds often contain such abominations as > © which the new code will treat indistinguishably from ©)
My suspicion is that the "right" thing to do at the sgmllib level is to categorize the markup and call a method depending on what the entity reference is, and let that handle whatever it is. For SGML, that means we have things like &name; (entity references), { (character references), and that's it. ģ isn't legal SGML under any circumstance; the "&#x;" syntax was introduced with XML.
... but it effectively is valid HTML. And as you point out below sgmllib's raison d’être is to support htmllib.
> Additionally, there is a unicode issue here - one that is shared by > handlecharref, but at least that method is overrideable. If unescaping > remains, do it for hex character references and for values greather than > 8-bits, i.e., use unichr instead of chr if the value is greater than 127.
For SGML, it's worse than that, since the document character set is defined in the SGML declaration, which is a far hairier beast than an XML declaration. :-)
understood
It really sounds like sgmllib is the wrong foundation for this. While the module has some questionable behaviors, none of them are signifcant in the context it's intended context (support for htmllib). Now, I understand that RSS has historical issues, with HTML-as-practiced getting embedded as payload data with various flavors of escaping applied, and I'm not an expert in the details of that. Have you looked at HTMLParser as an alternate to sgmllib? It has better support for XHTML constructs.
HTMLParser is less forgiving, and generally less suitable for consuming HTML as practiced.
- Sam Ruby
- Previous message: [Python-Dev] sgmllib Comments
- Next message: [Python-Dev] sgmllib Comments
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]