Issue 9241: SAXParseError on unicode (Japanese) file (original) (raw)

When parsing a UTF-16 little-endian encoded XML file containing some japanese characters, the xml.sax.parse function raises a SAXParseException exception saying "no element found". Problem arises with/on:

Python 2.5.2/Windows XP Pro SP3 32 bit Python 2.6.4/Windows XP Pro SP3 32 bit Python 2.5.2/Windows 2008 Server SP2 64 bit

The same file is successfully processed with/on:

Python 2.4.3/CentOS 5.4 Python 2.6.3/CentOS 5.4

I've attached a minimal XML file that contains a single U+FF1A japanese character that triggers the exception. Code for parsing the file follows:

import xml.sax xml.sax.parse(open("ff1a.xml"), xml.sax.ContentHandler())

Best regards, Gianfranco

Your file contains the byte \x1a == EOF. You should not open it in text mode, but in binary mode, otherwise it's truncated.

import xml.sax xml.sax.parse(open("ff1a.xml", 'rb'), xml.sax.ContentHandler())

works on all versions I tried.