Message 75864 - Python tracker (original) (raw)

This is a bug in two halves.

  1. Not all characters in the file are UTF-16. The initial xml header isn't, and the individual < > etc characters are not. This is just a matter of extending the methodology to encode all characters and not just the textual bits. There is no work-around except a five-minute hack of the ElementTree.write() method.

  2. Every write has a BOM, so corrupting the file in a manner analogous to bug 555360. This is a result of using string.encode() and is a well-known feature. It can be worked around by using UTF-16LE or UTF-16BE which do not prepend a BOM, but then the file doesn't have any BOM. A complete solution would be to rewrite ElementTree.write() to use a different encoding methodology such as StreamWriter.

I have made the above hack and work-around for my own use, and I can report that it produces perfect UTF-16.