[Python-Dev] sgmllib Comments (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Mon Jun 12 08🔞50 CEST 2006

Previous message: [Python-Dev] sgmllib Comments
Next message: [Python-Dev] sgmllib Comments
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Sam Ruby wrote:

If we can agree on the behavior, I would be glad to write up a patch.

It seems to me that the simplest way to proceed would be for the code that attempts to resolve character references (both named and numeric) in attributes to be isolated in a single method. Subclasses that desire different behavior (including the existing Python 2.4 and prior behaviour) could simply override this method.

In SGML, this is problematic: The named things are not character references, they are entity references, and it isn't necessarily the case that they expand to a character. For example, &author; might expand to "Martin v. Löwis", and &logo; might refer to a bitmap image which is unparsed.

That said, providing a overridable replacement function sounds like the right approach. To keep with tradition, I would still distinguish between character references and entity references, i.e. providing two overridable functions instead. Returning None could mean that no replacement is available.

As for default implementations, I think they should do what currently happens: entity references are replaced according to entitydefs, character references are replaced to bytes if they are smaller than 256.

Contrary to what others said, it appears that SGML does support hexadecimal character references, provided that the SGML declaraction contains the HCRO definition (which, for HTML and XML, is defined as HCRO "&#x"). So it seems safe to process hex character references by default (although it isn't safe to assume Unicode, IMO).

Regards, Martin

Previous message: [Python-Dev] sgmllib Comments
Next message: [Python-Dev] sgmllib Comments
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list