[Python-Dev] Improve open() to support reading file starting with an unicode BOM (original) (raw)

Olemis Lang olemis at gmail.com
Mon Jan 11 19:58:01 CET 2010


On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner <victor.stinner at haypocalc.com> wrote:

Hi,

Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be "ignored". [...]

I had similar issues too (please read below ;o) ...

On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <guido at python.org> wrote:

I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding?

About guessing the encoding, I experienced this issue while I was developing a Trac plugin. What I was doing is as follows :

... and still get the BOM in the first value of the first row in the CSV file.

{{{ #!python

mimetype 'utf-16-le' ef = EncodedFile(f, 'utf-8', mimetype) }}}

IMO I think I am +1 for leaving open just like it is, and use module codecs to deal with encodings, but I am strongly -1 for returning the BOM while using EncodedFile (mainly because encoding is explicitly supplied in ;o)

--Guido

CMIIW anyway ...

-- Regards,

Olemis.

Blog ES: http://simelo-es.blogspot.com/ Blog EN: http://simelo-en.blogspot.com/

Featured article:



More information about the Python-Dev mailing list