Issue 1470540: XMLGenerator creates a mess with UTF-16 (original) (raw)
When output encoding in xml.sax.saxutils.XMLGenerator is set to UTF-16, the result is a terrible mess. Namely:
it does not encode the XML declaration at the very top of the file (leaving it in single-byte Latin);
it leaves closing '>' of each start tag unencoded (that is, always outputs a single byte);
it inserts a spurious byte order mark for each tag, each attribute, each text node, and each processing instruction.
A test illustrating the issue is attached. The issue is applicable to both stable (2.4.3) and current (2.5) versions of Python.
Looking in xml/sax/saxutils.py, I see the problem in XMLGenerator._write():
- one-byte strings aren't recoded at all (sic!);
- two-byte strings are converted using unicode.encode(); this results in a BOM for each call of _write() on Unicode strings.
The issue is easy to fix by using StreamWriter instead of a plain stream as the output sink. I am going to submit a patch shortly.
Regards, Nikolai Grigoriev