[Python-Dev] Unicode byte order mark decoding (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Fri Apr 8 04:22:50 CEST 2005


"MvL" == "Martin v. Löwis" <martin at v.loewis.de> writes:

MvL> This would also support your usecase, and in a better way.
MvL> The Unicode assertion that UTF-16 is BE by default is void
MvL> these days - there is *always* a higher layer protocol, and
MvL> it more often than not specifies (perhaps not in English
MvL> words, but only in the source code of the generator) that the
MvL> default should by LE.

That is not a protocol. A protocol is a published specification, not merely a frequent accident of implementation. Anyway, both ISO 10646 and the Unicode standard consider that "internal use" and there is no requirement at all placed on those data. And such generators typically take great advantage of that freedom---have you looked in a .doc file recently? Have you noticed how many different options (previous implementations) of .doc are offered in the Import menu?

"MAL" == "M.-A. Lemburg" <mal at egenix.com> writes:

MAL> I've checked the various versions of the Unicode standard
MAL> docs: it seems that the quote you have was silently
MAL> introduced between 3.0 and 4.0.

Probably because ISO 10646 was always BE until the standards were unified. But note that ISO 10646 standardizes only use as a communications medium. Neither ISO 10646 nor Unicode makes any specification about internal usage. Conformance in internal processing is a matter of the programmer's convenience in producing conforming output.

MAL> Python currently uses version 3.2.0 of the standard and I
MAL> don't think enough people are aware of the change in the
MAL> standard

There's only one (corporate) person that matters: Microsoft.

MAL> By the time we switch to 4.1 or later, we can then make the
MAL> change in the native UTF-16 codec as you requested.

While in principle I sympathize with Nick, pragmatically Microsoft is unlikely to conform. They will take the position that files created by Windows are "internal" to the Windows environment, except where explicitly intended for exchange with arbitrary platforms, and only then will they conform. As Martin points out, that is what really matters for these defaults. I think you should look to see what Microsoft does.

MAL> Personally, I think that the Unicode consortium should not
MAL> have introduced a default for the UTF-16 encoding byte
MAL> order. Using big endian as default in a world where most
MAL> Unicode data is created on little endian machines is not very
MAL> realistic either.

It's not a default for the UTF-16 encoding byte order. It's a default for the UTF-16 encoding byte order when UTF-16 is a communications medium. Given that the generic network byte order is bigendian, I think it would be insane to specify littleendian as Unicode's default.

With Unicode same as network, you specify UTF-16 strings internally as an array of uint16_t, and when you put them on the wire (including saving them to a file that might be put on the wire as octet-stream) you apply htons(3) to it. On reading, you apply ntohs(3) to it. The source code is portable, the file is portable. How can you beat that?

-- School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Ask not how you can "do" free software business; ask what your business can "do for" free software.



More information about the Python-Dev mailing list