[Python-Dev] Improve open() to support reading file starting with an unicode BOM (original) (raw)

Guido van Rossum guido at python.org
Fri Jan 8 16:56:46 CET 2010

Previous message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Next message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Jan 8, 2010 at 1:05 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:

It is crazy, but unfortunately rather common. Wikipedia has a good description of the issues: <http://en.wikipedia.org/wiki/UTF-8#Byte-ordermark>. Basically, some Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it.

That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. I think what Glyph meant is this: if a file starts with the UTF-8 signature, assume it's UTF-8. Then validate the assumption against the rest of the file also, and then process it as UTF-8. If the rest clearly is not UTF-8, assume that the UTF-8 signature is bogus. I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor). FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy.

Sure. I said "crazy talk" only to stir up discussion. Which worked. :-)

Also, I don't want Python's default behavior to change -- sniffing the encoding should be a separate option.

-- --Guido van Rossum (python.org/~guido)

Previous message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Next message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list