[Python-Dev] Unicode entities in XML cause problems :-( (original) (raw)

Martin v. Loewis martin@v.loewis.de
28 Apr 2002 11:16:44 +0200


"Matthias Urlichs" <smurf@noris.de> writes:

> The proper fix, IMO, is to have writexml accept an encoding argument, > and, by default, write the output as UTF-8. Then there is no need for > character or entity references. > The encoding should probably default to the one from the document header (UTF-8 if that isn't given).

In .toxml, we are going to create a document header. We can put anything into there that we want.

If you think that the encoding should be the one that the "original" document had - that cannot work. First, the parser does not provide that information, and the DOM does not preserve it. Furthermore, there doesn't even have to be an original document - the DOM tree could have been created from scratch.

For XML escaping, the approach suggested by this patch would be to use xmlcharrefreplace() (see the test script) as the error handler. But that doesn't help with &<>". Personally, I rather dislike having to do a separate replace() for these.

One approach would be to use character maps which have strategic holes where & < > and possibly " live..?

Depends on your output encoding. If you want to use us-ascii as an output encoding, then it would be easy to create a character map codec that has holes for these characters.

If the user wants to specify the output encoding, this may be more difficult, since the codec for the output encoding may not be based on character maps. Since this is application that the SF patch has in mind, I doubt you can avoid the replace calls.

Regards, Martin