[Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py (original) (raw)
Walter Dörwald walter at livinglogic.de
Fri Apr 13 13:49:47 CEST 2007
- Previous message: [Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py
- Next message: [Python-3000] Line continuation using ellipsis
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Guido van Rossum wrote:
>> > I wonder if it would be possible to return the state as a pair >> > (unread, flags) where unread is a (byte) string of unprocessed bytes >> > and flags is some other state, with the constraint that in the initial >> > state the flags must be zero. Then I can optimize the case where flags >> > is returned as zero by subtracting len(unread) from the current >> > position and that'd be the correct seek position. >> >> I'd say that bytestream.tell() is the correct position. >> >> Or should seek() return to the last position where the codec was in a >> default state without anything buffered? (This can't work for UTF-16, >> because the codec almost never is in the default state.) > > That was my hope, yes (and I realize that UTF-16 is an exception).
We could designate natural endianness as the default state, but that would mean that a codec state can't be transferred to a different machine (or we could declare little (or big) endianness to be the default state). I think it's okay for file positions involving codec states not to be tranferable between platforms. I think they wouldn't even be guaranteed between subsequent runs of the same program.
OK, done in the third version of the patch.
> Consider UTF-8 though. If the chunk we read from the byte stream ended > in the middle of a multi-byte character, the codec will have the first > part of that character buffered. In general we want to subtract > buffered data from the byte stream's position when reporting the > position of the text stream. The idea is that if we later seek to the > reported position, we should be reading the same character data. This > can be accomplished in two ways: by backing up the byte stream to the > previous character boundary, and resetting the decoder to neutral; or > by positioning the byte stream to where it was originally and setting > the state of the decoder to what it was before. However, backing up > the byte stream has the advantage that no decoder state needs to be > encoded in the position cookie.
OK, so for decoders getstate() should always return a tuple, with the first entry being the buffered byte string (or bytes object?) and the second being additional state data. Do we need any specification for encoders? I don't need this for encoders at all -- we don't use incremental encoders, only incremental decoders.
True for reading, but what about writing?
>> The state returned from getstate() should be treated as an opaque value >> (e.g. for the buffered incremental codecs it is the buffered string, for >> the UTF-16 encoder it's the flag indicating whether a BOM has been >> written etc.). The codecs try to return None, if they are in some kind >> of default state (e.g. there's nothing buffered). > > I would like to await completion of those unit tests;
The second version of the patch includes the unit tests (and fixes the utf-8-sig codec). > there seem to be > some subtle issues. Can you be more concrete? I think I just meant the str/bytes issue I already mentioned.
Since the new version never sets the buffer to an explicit value except in the constructor this problem should have disappeared.
> I wonder if setstate() should call self.reset() > first.
Calling reset() and calling setstate() with the initial state should have the same effect. OK, I should do that anyway. (I wasn't aware of reset() until I saw your patch. ;-) > I'd also like to ask if setstate() could default to "" only if > the argument is None, not if it is empty; I'd like to use it to change > the buffer to be a bytes object. I'd say for Python 3000 it should always be a bytes object. Eventually, yes. But right now we're in a world where sometimes there are bytes and sometimes there are (8-bit) strings -- and I'd like to get as many tests passing with the new IO library without making it the default first.
OK.
Will this interoperate seamlessly with the C part of the codec machinery? It should if it uses the buffer API as it should. When I encounter places where it requires 8-bit strings I'll fix them opportunistically.
> (And yes, I need to maintain more > hacks for that, alas).
I'l try to update the patch tomorrow or over the weekend. Thanks!
Done. I've also added documentation (The description of the constraints on the decoder state sounds quite esoteric ;)).
Servus, Walter
- Previous message: [Python-3000] [Python-3000-checkins] r54742 - in python/branches/p3yk/Lib: io.py test/test_io.py
- Next message: [Python-3000] Line continuation using ellipsis
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]