[Python-Dev] Improve open() to support reading file starting with an unicode BOM (original) (raw)
Tres Seaver tseaver at palladion.com
Fri Jan 8 07:12:12 CET 2010
- Previous message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
- Next message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Guido van Rossum wrote:
On Thu, Jan 7, 2010 at 7:34 PM, Glyph Lefkowitz <glyph at twistedmatrix.com> wrote:
On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote: On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner <victor.stinner at haypocalc.com> wrote: Hi, Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be "ignored". I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? It is crazy, but unfortunately rather common. Wikipedia has a good description of the issues: <http://en.wikipedia.org/wiki/UTF-8#Byte-ordermark>. Basically, some Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it. That doesn't make sense. If the file isn't UTF-8 you can't see the BOM, because the BOM itself is UTF-8-encoded. (And yes, I know this happens. Doesn't mean we need to auto-guess by default; there are lots of issues e.g. what should happen after seeking to offset 0?)
The BOM should not be seekeable if the file is opened with the proposed "guess encoding from BOM" mode: it isn't properly part of the stream at all in that case.
A UTF-8 BOM is an absurditiy, but it exists everywhere in the wild: Python would do wll to make it as easy as possible to consume such files, as well as the non-insane versions (UTF-16 / UTF-32 BOMs). In the best of all possible worlds, I would just try opening the file so:
f = open('/path/to/file', 'r', encoding="DWIFM")
and any BOM present would set the encoding for the remainder of the stream..
Tres. - --
Tres Seaver +1 540-429-0999 tseaver at palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAktGzLsACgkQ+gerLs4ltQ5+cwCdGfycPdj6+cPfD23vH644SpHL sI0AoLGD7nfgMEJdJhBr90yjQQHfDgcJ =js+2 -----END PGP SIGNATURE-----
- Previous message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
- Next message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]