[Python-Dev] Improve open() to support reading file starting with an unicode BOM (original) (raw)
Glyph Lefkowitz glyph at twistedmatrix.com
Fri Jan 8 04:34:36 CET 2010
- Previous message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
- Next message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Jan 7, 2010, at 7:52 PM, Guido van Rossum wrote:
On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner <victor.stinner at haypocalc.com> wrote:
Hi,
Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be "ignored".
I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding?
It is crazy, but unfortunately rather common. Wikipedia has a good description of the issues: <http://en.wikipedia.org/wiki/UTF-8#Byte-order_mark>. Basically, some Windows text APIs will emit a UTF-8 "BOM" in order to identify the file as being UTF-8, so it's become a convention to do that. That's not good enough, so you need to guess the encoding as well to make sure, but if there is a BOM and you can otherwise verify that the file is probably UTF-8 encoded, you should discard it.
-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20100107/1bc40870/attachment-0007.htm>
- Previous message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
- Next message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]