[Python-Dev] Unicode byte order mark decoding (original) (raw)

M.-A. Lemburg mal at egenix.com
Thu Apr 7 11:07:58 CEST 2005

Previous message: [Python-Dev] Unicode byte order mark decoding
Next message: [Python-Dev] Unicode byte order mark decoding
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Nicholas Bastin wrote:

On Apr 5, 2005, at 6:19 AM, M.-A. Lemburg wrote:

Note that the UTF-16 codec is strict w/r to the presence of the BOM mark: you get a UnicodeError if a stream does not start with a BOM mark. For the UTF-8-SIG codec, this should probably be relaxed to not require the BOM. I've actually been confused about this point for quite some time now, but never had a chance to bring it up. I do not understand why UnicodeError should be raised if there is no BOM. I know that PEP-100 says: 'utf-16': 16-bit variable length encoding (little/big endian) and: Note: 'utf-16' should be implemented by using and requiring byte order marks (BOM) for file input/output. But this appears to be in error, at least in the current unicode standard. 'utf-16', as defined by the unicode standard, is big-endian in the absence of a BOM: --- 3.10.D42: UTF-16 encoding scheme: ... * The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian. ---

The problem is "in the absence of a higher level protocol": the codec doesn't know anything about a protocol - it's the application using the codec that knows which protocol get's used. It's a lot safer to require the BOM for UTF-16 streams and raise an exception to have the application decide whether to use UTF-16-BE or the by far more common UTF-16-LE.

Unlike for the UTF-8 codec, the BOM for UTF-16 is a configuration parameter, not merely a signature.

In terms of history, I don't recall whether your quote was already in the standard at the time I wrote the PEP. You are the first to have reported a problem with the current implementation (which has been around since 2000), so I believe that application writers are more comfortable with the way the UTF-16 codec is currently implemented. Explicit is better than implicit :-)

The current implementation of the utf-16 codecs makes for some irritating gymnastics to write the BOM into the file before reading it if it contains no BOM, which seems quite like a bug in the codec.

The codec writes a BOM in the first call to .write() - it doesn't write a BOM before reading from the file.

I allow for the possibility that this was ambiguous in the standard when the PEP was written, but it is certainly not ambiguous now.

See above.

Thanks,

Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Apr 07 2005)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Previous message: [Python-Dev] Unicode byte order mark decoding
Next message: [Python-Dev] Unicode byte order mark decoding
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list