[Python-ideas] TextIOWrapper callable encoding parameter (original) (raw)

Rurpy rurpy at yahoo.com
Wed Jun 13 17:46:01 CEST 2012


On 06/11/2012 10:24 AM, Stephen J. Turnbull wrote:

> Nick Coghlan writes: > > > Immediate thought: it seems like it would be easier to offer a way to > > inject data back into a buffered IO object's internal buffer. > > ungetch()?

What would be the TextIOWrapper api for that?

> If you're only interested in the top of the file (see below), I would > suggest allowing only one bufferfull, and then simply rewinding the > buffer pointer once you're done. This is one strategy used by Emacsen > for encoding detection (for the reason pointed out by Rurpy: not all > streams are rewindable). > > But is that really "easier"? It might be more general, but you still > need to reinitialize the encoding (ie, from the trivial "binary" to > whatever is detected), with all the hair that comes with that.

I don't think there is any hair involved. In at least the _pyio version of TextIOWrapper, initializing the encoding (in the read path) consists of calling self._get_decoder(). One needs to move the few places where that is called now to nearby places that are after the raw buffer has been read but before it is decoded. There may be need for some consideration given to raising errors at the old locations in the case the callable encoding hook is not being used (to maintain complete backwards compatibility; not sure that is necessary), but I wouldn't call that hairy. Of course there may be other factors I am missing...

> > > Executive summary: > > > ================== > > > > > > There is no good way to read a text file when the > > > encoding has to be determined by reading the start > > > of the file. A long-winded version of that follows. > > > Scroll down the the "Proposal" section to skip it. > > This may be insufficiently general. Specifically, both Emacsen and vi > allow specification of editor configuration variables at the bottom of > the file as well as the top. I don't know whether vi allows encoding > specs at the bottom, but Emacsen do (but only for files). > > I wouldn't recommend paying much attention to what Emacsen actually > do when initializing a stream (it's, uh, "baroque").

Looking only at the beginning of an input stream is general enough for a large class of problems including tokenizing python source code.



More information about the Python-ideas mailing list