[Python-Dev] Improve open() to support reading file starting with an unicode BOM (original) (raw)

Tres Seaver tseaver at palladion.com
Fri Jan 8 22:14:59 CET 2010


-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

Martin v. Löwis wrote:

It is crazy, but unfortunately rather common. Wikipedia has a good description of the issues: <http://en.wikipedia.org/wiki/UTF-8#Byte-ordermark>. Basically, some Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. I think what Glyph meant is this: if a file starts with the UTF-8 signature, assume it's UTF-8. Then validate the assumption against the rest of the file also, and then process it as UTF-8. If the rest clearly is not UTF-8, assume that the UTF-8 signature is bogus.

If the programmer opens the file using a "guess using the BOM" encoding, Python should not attempt to verify that the file is properly encoded: it should check for (and consume) any BOM, and then return a stream which uses the encoding inferred from the BOM. Any errors should be handled later, when characters are read, exactly as if the file had been opened with the same encoding guessed from the BOM.

I understood this proposal as a general processing guideline, not something the io library should do (but, say, a text editor).

FWIW, I'm personally in favor of using the UTF-8 signature. If people consider them crazy talk, that may be because UTF-8 can't possibly have a byte order - hence I call it a signature, not the BOM. As a signature, I don't consider it crazy at all. There is a long tradition of having magic bytes in files (executable files, Postscript, PDF, ... - see /etc/magic). Having a magic byte sequence for plain text to denote the encoding is useful and helps reducing moji-bake. This is the reason it's used on Windows: notepad would normally assume that text is in the ANSI code page, and for compatibility, it can't stop doing that. So the UTF-8 signature gives them an exit strategy.

Agreed. Having that marker at the start of the file makes interop with other tools much easier.

Tres. - --

Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAktHoFMACgkQ+gerLs4ltQ73dACffwUfyh6Q9vUnKYf367QFjNcU RRMAoNuKCWEx7j+MSdTv+UjhAPynBc14 =uAX6 -----END PGP SIGNATURE-----



More information about the Python-Dev mailing list