[Python-Dev] sgmllib Comments (original) (raw)
Sam Ruby rubys at intertwingly.net
Mon Jun 12 12:49:50 CEST 2006
- Previous message: [Python-Dev] sgmllib Comments
- Next message: [Python-Dev] sgmllib Comments
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Martin v. Löwis wrote:
Sam Ruby wrote:
If we can agree on the behavior, I would be glad to write up a patch.
It seems to me that the simplest way to proceed would be for the code that attempts to resolve character references (both named and numeric) in attributes to be isolated in a single method. Subclasses that desire different behavior (including the existing Python 2.4 and prior behaviour) could simply override this method. In SGML, this is problematic: The named things are not character references, they are entity references, and it isn't necessarily the case that they expand to a character. For example, &author; might expand to "Martin v. Löwis", and &logo; might refer to a bitmap image which is unparsed. That said, providing a overridable replacement function sounds like the right approach. To keep with tradition, I would still distinguish between character references and entity references, i.e. providing two overridable functions instead. Returning None could mean that no replacement is available. As for default implementations, I think they should do what currently happens: entity references are replaced according to entitydefs, character references are replaced to bytes if they are smaller than 256. Contrary to what others said, it appears that SGML does support hexadecimal character references, provided that the SGML declaraction contains the HCRO definition (which, for HTML and XML, is defined as HCRO "&#x"). So it seems safe to process hex character references by default (although it isn't safe to assume Unicode, IMO).
I don't see why expanding to multiple characters is a problem.
Just so that we have a tracking number and real code to anchor this discussion, I've opened the following and attached a patch:
This implementation does handle multiple character expansions. It does default to exactly what the current code does. It does not currently handle hexadecimal character references.
It also does pass all the current sgmllib tests, though I did not include any additional tests in this initial patch.
- Sam Ruby
- Previous message: [Python-Dev] sgmllib Comments
- Next message: [Python-Dev] sgmllib Comments
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]