[Python-Dev] Improve open() to support reading file starting with an unicode BOM (original) (raw)

James Y Knight foom at fuhm.net
Fri Jan 8 22:49:23 CET 2010


On Jan 8, 2010, at 4:14 PM, Tres Seaver wrote:

I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor).

FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy. Agreed. Having that marker at the start of the file makes interop with other tools much easier.

Putting the BOM at the beginning of UTF-8 text files is not a good
idea, it makes interop much worse on a unix system, not better.
Without the BOM, most commands do the right thing with UTF-8 text.
E.g. to concatenate two files:

$ cat file-1 file-2 > file-3

With a BOM at the beginning of the file, it won't work right. Of
course, you could modify "cat" (and every other stream processing
command) to know how to consume and emit BOMs, and omit the extra one
that would show up in the middle of the stream...but even that can't
work; what about: $ (cat file-1; cat file-2) > file-3.

Should the shell now know that when you run multiple commands, it
should eat the BOM emitted from the second command?

Basically, using a BOM in a utf-8 file is just not a good idea: it
completely ruins interop with every standard unix tool.

This is not to say that Python shouldn't have a way to read a file
with a UTF-8 BOM: it just shouldn't encourage you to write such files.

James



More information about the Python-Dev mailing list