Issue 33361: readline() + seek() on codecs.EncodedFile breaks next readline() (original) (raw)

Created on 2018-04-26 01:06 by da, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 8278 merged ammar2,2018-07-14 00:20
PR 13708 merged miss-islington,2019-05-31 19:44
Messages (14)
msg315768 - (view) Author: Diego Argueta (da) * Date: 2018-04-26 01:06
It appears that calling readline() on a codecs.EncodedFile stream breaks seeking and causes subsequent attempts to iterate over the lines or call readline() to backtrack and return already consumed lines. A minimal example: ``` from __future__ import print_function import codecs import io def run(stream): offset = stream.tell() try: stream.seek(0) header_row = stream.readline() finally: stream.seek(offset) print('Got header: %r' % header_row) if stream.tell() == 0: print('Skipping the header: %r' % stream.readline()) for index, line in enumerate(stream, start=2): print('Line %d: %r' % (index, line)) b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('utf-16-le')) s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le') run(s) ``` Output: ``` Got header: 'a,b\r\n' Skipping the header: '"asdf","jkl;"\r\n' <-- this is line 2 Line 2: 'a,b\r\n' <-- this is line 1 Line 3: '"asdf","jkl;"\r\n' <-- now we're back to line 2 ``` As you can see, the line being skipped is actually the second line, and when we try reading from the stream again, the iterator starts from the beginning of the file. Even weirder, adding a second call to readline() to skip the second line shows it's going **backwards**: ``` Got header: 'a,b\r\n' Skipping the header: '"asdf","jkl;"\r\n' <-- this is actually line 2 Skipping the second line: 'a,b\r\n' <-- this is line 1 Line 2: '"asdf","jkl;"\r\n' <-- this is now correct ``` The expected output shows that we got a header, skipped it, and then read one data line. ``` Got header: 'a,b' Skipping the header: 'a,b\r\n' Line 2: '"asdf","jkl;"\r\n' ``` I'm sure this is related to the implementation of readline() because if we change this: ``` header_row = stream.readline() ``` to this: ``` header_row = stream.read().splitlines()[0] ``` then we get the expected output. If on the other hand we comment out the seek() in the finally clause, we also get the expected output (minus the "skipping the header") code.
msg315788 - (view) Author: Elena Oat (Elena.Oat) * Date: 2018-04-26 12:16
I cannot replicate this when the stream is: In: stream_ex = io.BytesIO(u"abc\ndef\nghi\n".encode("utf-8")) In: f = codecs.EncodedFile(stream_ex, 'utf-8') In: run(f) Out: Got header: b'abc\n' Skipping the header: b'abc\n' Line 2: b'def\n' Line 3: b'ghi\n'
msg315808 - (view) Author: Diego Argueta (da) * Date: 2018-04-26 18:02
That's because the stream isn't transcoding, since UTF-8 is ASCII-compatible. Try using something not ASCII-compatible as the codec e.g. 'ibm500' and it'll give incorrect results. ``` b = io.BytesIO(u'a,b\r\n"asdf","jkl;"\r\n'.encode('ibm500')) s = codecs.EncodedFile(b, 'ibm500') ``` ``` Got header: '\x81k\x82\r%' Skipping the header. '\x7f\x81\xa2\x84\x86\x7fk\x7f\x91\x92\x93^\x7f\r%' Line 2: '\x81k\x82\r%' Line 3: '\x7f\x81\xa2\x84\x86\x7fk\x7f\x91\x92\x93^\x7f\r%' ```
msg315809 - (view) Author: Diego Argueta (da) * Date: 2018-04-26 18:08
Update: If I run your exact code it still breaks for me: ``` Got header: 'abc\n' Skipping the header. 'def\n' Line 2: 'ghi\n' Line 3: 'abc\n' Line 4: 'def\n' Line 5: 'ghi\n' ``` I'm running Python 2.7.14 and 3.6.5 on OSX 10.13.4. Startup banners: Python 2.7.14 (default, Feb 7 2018, 14:15:12) [GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin Python 3.6.5 (default, Apr 2 2018, 14:03:12) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.1)] on darwin
msg315835 - (view) Author: Elena Oat (Elena.Oat) * Date: 2018-04-27 11:53
I've tried this with Python 3.6.0 on OSX 10.13.4
msg315841 - (view) Author: Elena Oat (Elena.Oat) * Date: 2018-04-27 13:46
For you specific example I get also a weird result. Tried this in Python 2.7.10 and Python 3.6.0.
msg315842 - (view) Author: Elena Oat (Elena.Oat) * Date: 2018-04-27 13:53
I've modified a little your example and it's clearly that the readline moves the cursor. ``` from __future__ import print_function import codecs import io def run(stream): offset = stream.tell() try: stream.seek(0) header_row = stream.readline() finally: stream.seek(offset) print(offset) print(stream.tell()) print('Got header: %r' % header_row) if stream.tell() == 0: print(stream.tell()) print(stream.readline()) print('Skipping the header: %r' % stream.readline()) for index, line in enumerate(stream, start=2): print('Line %d: %r' % (index, line)) b = io.BytesIO(u'ab\r\ncd\ndef\n'.encode('utf-16-le')) s = codecs.EncodedFile(b, 'utf-8', 'utf-16-le') run(s) ``` The first call to readline returns cd instead of ab.
msg317245 - (view) Author: Diego Argueta (da) * Date: 2018-05-21 17:47
Update: Tested this on Python 3.5.4, 3.4.8, and 3.7.0b3 on OSX 10.13.4. They also exhibit the bug. Updating the ticket accordingly.
msg321634 - (view) Author: Diego Argueta (da) * Date: 2018-07-13 21:05
Bug still present in 3.7.0, now seeing it in 3.8.0a0 as well.
msg343407 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2019-05-24 16:20
Possibly related to #8260 ("When I use codecs.open(...) and f.readline() follow up by f.read() return bad result"), which was never fully fixed in that issue, though #32110 ("Make codecs.StreamReader.read() more compatible with read() of other files") may have fixed more (all?) of it.
msg343596 - (view) Author: Diego Argueta (da) * Date: 2019-05-27 01:38
> though #32110 ("Make codecs.StreamReader.read() more compatible with read() of other files") may have fixed more (all?) of it. Still seeing this in 3.7.3 so I don't think so?
msg344111 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2019-05-31 19:44
New changeset a6ec1ce1ac05b1258931422e96eac215b6a05459 by Berker Peksag (Ammar Askar) in branch 'master': bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278) https://github.com/python/cpython/commit/a6ec1ce1ac05b1258931422e96eac215b6a05459
msg344115 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2019-05-31 20:03
New changeset a6dc5d4e1c9ef465dc1f1ad95c382aa8e32b178f by Berker Peksag (Miss Islington (bot)) in branch '3.7': bpo-33361: Fix bug with seeking in StreamRecoders (GH-8278) https://github.com/python/cpython/commit/a6dc5d4e1c9ef465dc1f1ad95c382aa8e32b178f
msg344116 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2019-05-31 20:04
Thank you for the report, Diego and thank you for the patch, Ammar!
History
Date User Action Args
2022-04-11 14:58:59 admin set github: 77542
2019-05-31 20:04:36 berker.peksag set status: open -> closedversions: - Python 2.7, Python 3.4, Python 3.5, Python 3.6messages: + resolution: fixedstage: patch review -> resolved
2019-05-31 20:03:28 berker.peksag set messages: +
2019-05-31 19:44:22 miss-islington set pull_requests: + <pull%5Frequest13595>
2019-05-31 19:44:16 berker.peksag set nosy: + berker.peksagmessages: +
2019-05-27 01:38:06 da set messages: +
2019-05-24 16:20:39 josh.r set nosy: + josh.rmessages: +
2018-07-14 00:20:38 ammar2 set keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest7813>
2018-07-13 21:05:39 da set messages: + versions: + Python 3.8
2018-05-21 21:33:39 josh.r set title: readline() + seek() on io.EncodedFile breaks next readline() -> readline() + seek() on codecs.EncodedFile breaks next readline()
2018-05-21 17:47:08 da set messages: + versions: + Python 3.4, Python 3.5, Python 3.7
2018-04-27 13:53:51 Elena.Oat set messages: +
2018-04-27 13:46:27 Elena.Oat set messages: +
2018-04-27 11:53:59 Elena.Oat set messages: +
2018-04-26 18:08:38 da set messages: +
2018-04-26 18:02:41 da set messages: +
2018-04-26 12:16:14 Elena.Oat set nosy: + Elena.Oatmessages: +
2018-04-26 01:06:03 da create