In previous versions of python, when there was a unicode decode error, StreamReader.readline() would advance to the next line. In the current version(2.4.2 and trunk), it doesn't. Testing under Linux AMD64 (Ubuntu 5.10) Attaching an example script. In python2.3 it prints: hi~ hi error: 'utf8' codec can't decode byte 0x80 in position 2: unexpected code byte error: 'utf8' codec can't decode byte 0x81 in position 2: unexpected code byte all done In python2.4 and trunk it prints: hi~ hi error: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte error: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte error: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte [repeats forever] Maybe this isn't actually supposed to work (the docs don't mention what should happen with strict error checking..), but it would be nice, given the alternatives: 1. use errors='replace' and then search the result for the replacement character. (ick) 2. define a custom error handler similar to ignore or replace, that also sets some flag. (slightly less ick, but more work.)
Logged In: YES user_id=89016 IMHO the current behaviour is more consistent. To read the broken utf-8 stream from the test script the appropriate error handler should be used. What is the desired outcome? If only the broken byte sequence should be skipped errors="replace" is appropriate. To skip a complete line that contains a broken byte sequence do something like in the attached skipbadlines.py. The StreamReader can't know which behaviour is wanted.