[Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py (original) (raw)
Guido van Rossum guido at python.org
Thu Apr 12 19:30:04 CEST 2007
- Previous message: [Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py
- Next message: [Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 4/12/07, Walter Dörwald <walter at livinglogic.de> wrote:
Guido van Rossum wrote: > On 4/11/07, Walter Dörwald <walter at livinglogic.de> wrote: >> Would it make sense to make the state of the decoder public, e.g. by >> adding setstate() and getstate() methods? This would give a cleaner API. > > I've been thinking of the same thing! > > I wonder if it would be possible to return the state as a pair > (unread, flags) where unread is a (byte) string of unprocessed bytes > and flags is some other state, with the constraint that in the initial > state the flags must be zero. Then I can optimize the case where flags > is returned as zero by subtracting len(unread) from the current > position and that'd be the correct seek position.
I'd say that bytestream.tell() is the correct position. Or should seek() return to the last position where the codec was in a default state without anything buffered? (This can't work for UTF-16, because the codec almost never is in the default state.)
That was my hope, yes (and I realize that UTF-16 is an exception). Consider UTF-8 though. If the chunk we read from the byte stream ended in the middle of a multi-byte character, the codec will have the first part of that character buffered. In general we want to subtract buffered data from the byte stream's position when reporting the position of the text stream. The idea is that if we later seek to the reported position, we should be reading the same character data. This can be accomplished in two ways: by backing up the byte stream to the previous character boundary, and resetting the decoder to neutral; or by positioning the byte stream to where it was originally and setting the state of the decoder to what it was before. However, backing up the byte stream has the advantage that no decoder state needs to be encoded in the position cookie.
> I imagine most > decoders have only very few flags they care about. (The worst might be > the utf-16 decoder which must have a flag to remember whether it > already saw a byte order marker, and another indicating the byte > order. Maybe we'll have to special-case that one, so don't worry too > much about it.) > >> Should I work on a patch? > > That would be great!
OK, here's the patch: http://bugs.python.org/1698994 The state returned from getstate() should be treated as an opaque value (e.g. for the buffered incremental codecs it is the buffered string, for the UTF-16 encoder it's the flag indicating whether a BOM has been written etc.). The codecs try to return None, if they are in some kind of default state (e.g. there's nothing buffered).
I would like to await completion of those unit tests; there seem to be some subtle issues. I wonder if setstate() should call self.reset() first. I'd also like to ask if setstate() could default to "" only if the argument is None, not if it is empty; I'd like to use it to change the buffer to be a bytes object. (And yes, I need to maintain more hacks for that, alas).
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
- Previous message: [Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py
- Next message: [Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]