[Python-3000] str/unicode tests: pyexpat.c and read(n) (original) (raw)

Talin talin at acm.org
Sat Jul 21 19:36:05 CEST 2007


James Y Knight wrote:

On Jul 21, 2007, at 12:25 AM, Fred L. Drake, Jr. wrote:

On Saturday 21 July 2007, Joe Gregorio wrote: Should xml.parsers.expat.XMLParser.ParseFile(file) operate on both text and binary streams? No. XML is a serialization of a markup language containing Unicode character into an encoded stream. Well...there's many reasons why it is useful to be able to parse an already-decoded unicode stream into XML, and to serialize XML into a unicode string. For example, if combining into a larger unicode document, or parsing from a literal string in the source code. Sure, normally XML is serialized to bytes, but it is also serializable to unicode, and that's a useful feature to have (if implementable).

The general use case for XML is reading or writing a document, where "document" means a bytestream from either a file or a socket.

The question is whether it would also be useful to parse Python strings that contain XML markup, or format an XML document into a Python string.

Some care needs to be taken here, because XML has its own way of specifying the character encoding. For example, suppose I have a python string that contains the characters:

'<?xml version="1.0" encoding="utf-8" ?>'

Well, the problem with this is that the encoding isn't UTF-8. Python 3000 strings are internally encoded as UTF-16 (although generally it tries to hide that fact from you so most of the time you don't have to care.)

Suppose then that you write this string out to a file (perhaps after combining it with other strings.) If I happen to write the file as UTF-8, then everything is fine, but if I happen to pick some other encoding that doesn't match the encoding attribute in the prologue then we have the potential for confusion.

This matters because there are lots of people who write XML documents with print statements (and many of them forget to handle things like escaping of entities and such.)

This also matters because the Python XML parsing libraries are mostly based on expat, which is C code that doesn't have any special knowledge of Python strings - it only works on the encodings that it can detect, or which you tell it to use.

So if you wanted to directly parse a Python string as XML, you would probably have to treat it as a byte array and override the encoding detection, telling it explicitly to use UTF-16.

James


Python-3000 mailing list Python-3000 at python.org http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/talin%40acm.org



More information about the Python-3000 mailing list