[Python-Dev] Unicode byte order mark decoding (original) (raw)

Tue Apr 5 08:25:09 CEST 2005

MAL> The BOM (byte order mark) was a non-standard Microsoft
MAL> invention to detect Unicode text data as such (MS always uses
MAL> UTF-16-LE for Unicode text files).
MAL> -1; there's no standard for UTF-8 BOMs - adding it to the
MAL> codecs module was probably a mistake to begin with. You
MAL> usually only get UTF-8 files with BOM marks as the result of
MAL> recoding UTF-16 files into UTF-8.
MAL> BTW, how do you know that s came from the start of a file and
MAL> not from slicing some already loaded file somewhere in the
MAL> middle ?
MAL> Evan Jones wrote:

>> This is *not* a valid Unicode character. The Unicode
>> specification (version 4, section 15.8) says the following
>> about non-characters:
>> 
>>> Applications are free to use any of these noncharacter code
>>> points internally but should never attempt to exchange
>>> them. If a noncharacter is received in open interchange, an
>>> application is not required to interpret it in any way. It is
>>> good practice, however, to recognize it as a noncharacter and
>>> to take appropriate action, such as removing it from the
>>> text. Note that Unicode conformance freely allows the removal
>>> of these characters. (See C10 in Section3.2, Conformance
>>> Requirements.)
>> 
>> My interpretation of the specification means that Python should
>> silently remove the character, resulting in a zero length
>> Unicode string.  Similarly, both of the following lines should
>> also result in a zero length Unicode string:

>>>> '\xff\xfe\xfe\xff'.decode( "utf16" )
> u'\ufffe'
>>>> '\xff\xfe\xff\xff'.decode( "utf16" )
> u'\uffff'
MAL> Hmm, wouldn't it be better to raise an error ? After all, a
MAL> reversed BOM mark in the stream looks a lot like you're
MAL> trying to decode a UTF-16 stream assuming the wrong byte
MAL> order ?!