[Python-Dev] Quick sum up about open() + BOM (original) (raw)

M.-A. Lemburg mal at egenix.com
Sat Jan 9 13:45:58 CET 2010


Victor Stinner wrote:

(2) Check for a BOM while reading or detect it before?

Everybody agree that checking BOM is an interesting option and should not be limited to open(). Marc-Andre proposed a codecs.guessfileencoding() function accepting a file name or a binary file-like object: it returns the encoding and seek to the file start or just after the BOM. I dislike this function because it requires extra file operations (open (optional), read() and seek()) and it doesn't work if the file is not seekable (eg. a pipe). I prefer to check for a BOM at first read in TextIOWrapper to avoid extra file operations. Note: I implemented the BOM check in TextIOWrapper; so it's already usable for any file-like object.

Yes, but the implementation is limited to just BOM checking and thus only supports UTF-8-SIG, UTF-16 and UTF-32.

With a codecs module function we could easily extend the encoding detection to more file types, e.g. XML files, Python source code files, etc. that use other mechanisms for defining the encoding.

BTW: I haven't looked at your implementation, but what happens when your BOM check fails ? Will the implementation add the already read bytes back to a buffer ?

This rollback action is the only reason for needing a seekable stream in codecs.guess_stream_encoding().

Another point to consider:

AFAIK, we currently have a moratorium on changes to Python builtins. How does that match up with the proposed changes ?

Using a new codec like Walter suggested would move the implementation into the stdlib for which doesn't the moratorium doesn't apply.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Jan 09 2010)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/



More information about the Python-Dev mailing list