[Python-Dev] Improve open() to support reading file starting with an unicode BOM (original) (raw)

Olemis Lang olemis at gmail.com
Mon Jan 11 22:29:38 CET 2010


Probably one part of this is OT , but I think it could complement the discussion ;o)

On Mon, Jan 11, 2010 at 3:44 PM, M.-A. Lemburg <mal at egenix.com> wrote:

Olemis Lang wrote:

On Thu, Jan 7, 2010 at 4:10 PM, Victor Stinner <victor.stinner at haypocalc.com> wrote:

Hi,

Builtin open() function is unable to open an UTF-16/32 file starting with a BOM if the encoding is not specified (raise an unicode error). For an UTF-8 file starting with a BOM, read()/readline() returns also the BOM whereas the BOM should be "ignored". [...] I had similar issues too (please read below ;o) ... On Thu, Jan 7, 2010 at 7:52 PM, Guido van Rossum <guido at python.org> wrote: I'm a little hesitant about this. First of all, UTF-8 + BOM is crazy talk. And for the other two, perhaps it would make more sense to have a separate encoding-guessing function that takes a binary stream and returns a text stream wrapping it with the proper encoding? About guessing the encoding, I experienced this issue while I was developing a Trac plugin. What I was doing is as follows : - I guessed the MIME type + charset encoding using Trac MIME API (it was a CSV file encoded using UTF-16) - I read the file using open - Then wrapped the file using codecs.EncodedFile - Then used csv.reader ... and still get the BOM in the first value of the first row in the CSV file. You didn't say, but I presume that the charset guessing logic returned either 'utf-16-le' or 'utf-16-be'

Yes. In fact they return the full mimetype 'text/csv; charset=utf-16-le' ... ;o)

- those encodings don't remove the leading BOM.

... and they should ?

The 'utf-16' codec will remove the BOM.

In this particular case there's nothing I can do, I have to process whatever charset is detected in the input ;o)

{{{ #!python

mimetype 'utf-16-le' ef = EncodedFile(f, 'utf-8', mimetype) }}} Same here: the UTF-8 codec will not remove the BOM, you have to use the 'utf-8-sig' codec for that. IMO I think I am +1 for leaving open just like it is, and use module codecs to deal with encodings, but I am strongly -1 for returning the BOM while using EncodedFile (mainly because encoding is explicitly supplied in ;o) Note that EncodedFile() doesn't do any fancy BOM detection or filtering.

... directly.

This is the job of the codecs.

OK ... to come back to the scope of the subject, in the general case, I think that BOM (if any) should be handled by codecs, and therefore indirectly by EncodedFile . If that's a explicit way of working with encodings I'd prefer to use that wrapper explicitly in order to (encode | decode) the file and let the codec detect whether there's a BOM or not and «adjust» tell, read and everything else in that wrapper (instead of open).

Also note that BOM removal is only valid at the beginning of a file. All subsequent BOM-bytes have to be read as-is (they map to a zero-width non-breaking space) - without removing them.

;o)

-- Regards,

Olemis.

Blog ES: http://simelo-es.blogspot.com/ Blog EN: http://simelo-en.blogspot.com/

Featured article: Test cases for custom query (i.e report 9) ... PASS (1.0.0) - http://simelo.hg.sourceforge.net/hgweb/simelo/trac-gviz/rev/d276011e7297



More information about the Python-Dev mailing list