msg315768 - (view) |
Author: Diego Argueta (da) * |
Date: 2018-04-26 01:06 |
It appears that calling readline() on a codecs.EncodedFile stream breaks seeking and causes subsequent attempts to iterate over the lines or call readline() to backtrack and return already consumed lines. A minimal example: ``` from __future__ import print_function import codecs import io def run(stream): offset = stream.tell() try: stream.seek(0) header_row = stream.readline() finally: stream.seek(offset) print('Got header: %r' % header_row) if stream.tell() == 0: print('Skipping the header: %r' % stream.readline()) for index, line in enumerate(stream, start=2): print('Line %d: %r' % (index, line)) b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('utf-16-le')) s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le') run(s) ``` Output: ``` Got header: 'a,b\r\n' Skipping the header: '"asdf","jkl;"\r\n' <-- this is line 2 Line 2: 'a,b\r\n' <-- this is line 1 Line 3: '"asdf","jkl;"\r\n' <-- now we're back to line 2 ``` As you can see, the line being skipped is actually the second line, and when we try reading from the stream again, the iterator starts from the beginning of the file. Even weirder, adding a second call to readline() to skip the second line shows it's going **backwards**: ``` Got header: 'a,b\r\n' Skipping the header: '"asdf","jkl;"\r\n' <-- this is actually line 2 Skipping the second line: 'a,b\r\n' <-- this is line 1 Line 2: '"asdf","jkl;"\r\n' <-- this is now correct ``` The expected output shows that we got a header, skipped it, and then read one data line. ``` Got header: 'a,b' Skipping the header: 'a,b\r\n' Line 2: '"asdf","jkl;"\r\n' ``` I'm sure this is related to the implementation of readline() because if we change this: ``` header_row = stream.readline() ``` to this: ``` header_row = stream.read().splitlines()[0] ``` then we get the expected output. If on the other hand we comment out the seek() in the finally clause, we also get the expected output (minus the "skipping the header") code. |
|
|
msg315788 - (view) |
Author: Elena Oat (Elena.Oat) * |
Date: 2018-04-26 12:16 |
I cannot replicate this when the stream is: In: stream_ex = io.BytesIO(u"abc\ndef\nghi\n".encode("utf-8")) In: f = codecs.EncodedFile(stream_ex, 'utf-8') In: run(f) Out: Got header: b'abc\n' Skipping the header: b'abc\n' Line 2: b'def\n' Line 3: b'ghi\n' |
|
|
msg315808 - (view) |
Author: Diego Argueta (da) * |
Date: 2018-04-26 18:02 |
That's because the stream isn't transcoding, since UTF-8 is ASCII-compatible. Try using something not ASCII-compatible as the codec e.g. 'ibm500' and it'll give incorrect results. ``` b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('ibm500')) s = codecs.EncodedFile(b, 'ibm500') ``` ``` Got header: '\x81k\x82\r%' Skipping the header. '\x7f\x81\xa2\x84\x86\x7fk\x7f\x91\x92\x93^\x7f\r%' Line 2: '\x81k\x82\r%' Line 3: '\x7f\x81\xa2\x84\x86\x7fk\x7f\x91\x92\x93^\x7f\r%' ``` |
|
|
msg315809 - (view) |
Author: Diego Argueta (da) * |
Date: 2018-04-26 18:08 |
Update: If I run your exact code it still breaks for me: ``` Got header: 'abc\n' Skipping the header. 'def\n' Line 2: 'ghi\n' Line 3: 'abc\n' Line 4: 'def\n' Line 5: 'ghi\n' ``` I'm running Python 2.7.14 and 3.6.5 on OSX 10.13.4. Startup banners: Python 2.7.14 (default, Feb 7 2018, 14:15:12) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin Python 3.6.5 (default, Apr 2 2018, 14:03:12) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin |
|
|
msg315835 - (view) |
Author: Elena Oat (Elena.Oat) * |
Date: 2018-04-27 11:53 |
I've tried this with Python 3.6.0 on OSX 10.13.4 |
|
|
msg315841 - (view) |
Author: Elena Oat (Elena.Oat) * |
Date: 2018-04-27 13:46 |
For you specific example I get also a weird result. Tried this in Python 2.7.10 and Python 3.6.0. |
|
|
msg315842 - (view) |
Author: Elena Oat (Elena.Oat) * |
Date: 2018-04-27 13:53 |
I've modified a little your example and it's clearly that the readline moves the cursor. ``` from __future__ import print_function import codecs import io def run(stream): offset = stream.tell() try: stream.seek(0) header_row = stream.readline() finally: stream.seek(offset) print(offset) print(stream.tell()) print('Got header: %r' % header_row) if stream.tell() == 0: print(stream.tell()) print(stream.readline()) print('Skipping the header: %r' % stream.readline()) for index, line in enumerate(stream, start=2): print('Line %d: %r' % (index, line)) b = io.BytesIO(u'ab\r\ncd\ndef\n'.encode('utf-16-le')) s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le') run(s) ``` The first call to readline returns cd instead of ab. |
|
|
msg317245 - (view) |
Author: Diego Argueta (da) * |
Date: 2018-05-21 17:47 |
Update: Tested this on Python 3.5.4, 3.4.8, and 3.7.0b3 on OSX 10.13.4. They also exhibit the bug. Updating the ticket accordingly. |
|
|
msg321634 - (view) |
Author: Diego Argueta (da) * |
Date: 2018-07-13 21:05 |
Bug still present in 3.7.0, now seeing it in 3.8.0a0 as well. |
|
|
msg343407 - (view) |
Author: Josh Rosenberg (josh.r) *  |
Date: 2019-05-24 16:20 |
Possibly related to #8260 ("When I use codecs.open(...) and f.readline() follow up by f.read() return bad result"), which was never fully fixed in that issue, though #32110 ("Make codecs.StreamReader.read() more compatible with read() of other files") may have fixed more (all?) of it. |
|
|
msg343596 - (view) |
Author: Diego Argueta (da) * |
Date: 2019-05-27 01:38 |
> though #32110 ("Make codecs.StreamReader.read() more compatible with read() of other files") may have fixed more (all?) of it. Still seeing this in 3.7.3 so I don't think so? |
|
|
msg344111 - (view) |
Author: Berker Peksag (berker.peksag) *  |
Date: 2019-05-31 19:44 |
New changeset a6ec1ce1ac05b1258931422e96eac215b6a05459 by Berker Peksag (Ammar Askar) in branch 'master': bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278) https://github.com/python/cpython/commit/a6ec1ce1ac05b1258931422e96eac215b6a05459 |
|
|
msg344115 - (view) |
Author: Berker Peksag (berker.peksag) *  |
Date: 2019-05-31 20:03 |
New changeset a6dc5d4e1c9ef465dc1f1ad95c382aa8e32b178f by Berker Peksag (Miss Islington (bot)) in branch '3.7': bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278) https://github.com/python/cpython/commit/a6dc5d4e1c9ef465dc1f1ad95c382aa8e32b178f |
|
|
msg344116 - (view) |
Author: Berker Peksag (berker.peksag) *  |
Date: 2019-05-31 20:04 |
Thank you for the report, Diego and thank you for the patch, Ammar! |
|
|