[Python-Dev] Improve open() to support reading file starting with an unicode BOM (original) (raw)

M.-A. Lemburg mal at egenix.com
Fri Jan 8 22:51:26 CET 2010

Previous message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Next message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Tres Seaver wrote:

M.-A. Lemburg wrote:

Shouldn't this encoding guessing be a separate function that you call on either a file or a seekable stream ? After all, detecting encodings is just as useful to have for non-file streams. Other stream sources typically have out-of-band ways to signal the encoding: only when reading from the filesystem do we pretty much have to guess, and in that case the BOM / signature is the best heuristic we have. Also, some non-file streams are not seekable, and so can't be guessed via a pre-pass.

Sure there are non-seekable file streams, but at least when using StringIO-type streams you don't have that restriction.

An encoding detection function would provide more use in other cases as well, so instead of hiding away the functionality in the open() constructor, I'm suggesting to make expose it via the codecs module.

Applications can then use it where necessary and also provide their own defaults, using other heuristics.

You'd then avoid having to stuff everything into a single function call and also open up the door for more complex application specific guess work or defaults.

The whole process would then have two steps: 1. guess encoding import codecs encoding = codecs.guessfileencoding(filename) Filename is not enough information: or do you mean that API to actually open the stream?

Yes. The API would open the file, guess the encoding and then close it again. If you don't want that to happen, you could use the second API I mentioned below on the already open file.

Note that this function could detect a lot more encodings with reasonably high probability than just BOM-prefixed ones, e.g. we could also add support to detect encoding declarations such as the ones we use in Python source files.

2. open the file with the found encoding

f = open(filename, encoding=encoding) For seekable streams f, you'd have: 1. guess encoding import codecs encoding = codecs.guessstreamencoding(f)

I forgot to mention: This API needs to position the file pointer to the start of the first data byte.

2. wrap the stream with a reader for the found encoding

readerclass = codecs.getreader(encoding) g = readerclass(f)

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Jan 08 2010)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

Previous message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Next message: [Python-Dev] Improve open() to support reading file starting with an unicode BOM
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list