msg164869 - (view) |
Author: lovelylain (lovelylain) |
Date: 2012-07-07 15:18 |
This is an example, `for line in fp` will raise UnicodeDecodeError: #! -*- coding: utf-8 -*- import codecs text = u'\u6731' + u'\U0002a6a5' * 18 print repr(text) with codecs.open('test.txt', 'wb', 'utf-16-le') as fp: fp.write(text) with codecs.open('test.txt', 'rb', 'utf-16-le') as fp: print repr(fp.read()) with codecs.open('test.txt', 'rb', 'utf-16-le') as fp: for line in fp: print repr(line) I read code in codecs.py: def read(self, size=-1, chars=-1, firstline=False): """ Decodes data from the stream self.stream and returns the resulting object. ... If firstline is true, and a UnicodeDecodeError happens after the first line terminator in the input only the first line will be returned, the rest of the input will be kept until the next call to read(). """ ... try: newchars, decodedbytes = self.decode(data, self.errors) except UnicodeDecodeError, exc: if firstline: newchars, decodedbytes = self.decode(data[:exc.start], self.errors) lines = newchars.splitlines(True) if len(lines)<=1: raise else: raise ... It seems that the firstline argument is not consistent with its doc description. I don't konw why this argument was added and why lines count was checked. If it was added for readline function to fix some decode errors, we may have no EOLs in data readed, so it caused UnicodeDecodeError too. Maybe we should write code like below to support codecs readline. def read(self, size=-1, chars=-1, autotruncate=False): ... try: newchars, decodedbytes = self.decode(data, self.errors) except UnicodeDecodeError, exc: if autotruncate and exc.start: newchars, decodedbytes = self.decode(data[:exc.start], self.errors) else: raise ... |
|
|
msg172372 - (view) |
Author: Marcus Gröber (Marcus.Gröber) |
Date: 2012-10-08 11:19 |
I came across this today as well. A short way of summarizing this error seems to be: Reading a file using readline (or "for line in file") fails, if the following two conditions are true: • A codec (e.g. UTF-8) for a multi-byte encoding is used, and • The first line of the file is at least 73 bytes long, and contains a multi-byte-sequence that starts before offset 72, and ends after offset 72 At least for UTF-8 input files, it may be possible to work around this by opening the input file without a codec, and then applying decode("utf-8") to each line. |
|
|
msg172391 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-08 16:07 |
This error happens due to the fact that utf16* decoders do not properly partial decode truncated data. Exception raised if input data truncated on the second surrogate in the surrogate pair. For example codecs.utf_16_le_decode(b'\x00\xd8\x00') should return ('', 0), but raises UnicodeDecodeError. |
|
|
msg172392 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-08 16:10 |
Here are the patches. |
|
|
msg172527 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-10-09 21:11 |
This issue may be related or a duplicate of #11461. > For example codecs.utf_16_le_decode(b'\x00\xd8\x00') should return ('', 0), but raises UnicodeDecodeError. Only incremental decoder should return partial results. Other decoders are strict and (usually) stateless. $ ./python >>> import codecs >>> decoder = codecs.getdecoder('utf8') >>> decoder('\u20ac'.encode('utf8'), 'strict') ('€', 3) >>> decoder('\u20ac'.encode('utf8')[:2], 'strict') UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data |
|
|
msg172529 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-10-09 21:17 |
> with codecs.open('test.txt', 'wb', 'utf-16-le') as fp: Since Python 2.6+, you can use io.open() which uses the new io library. The io library uses TextIOWrapper which uses incremental encoder and decoder and so handles multibyte encodings correctly (as UTF-16). Said differently, this issue is already fixed in the io library. It remembers me that I should propose again my PEP 400 :-) |
|
|
msg172530 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-10-09 21:19 |
> This issue may be related or a duplicate of #11461. Hum no. The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders. I don't want to fix this issue: it's better to move to the io library for the reasons listed in the PEP 400. |
|
|
msg172532 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-09 21:34 |
> This issue may be related or a duplicate of #11461. Oh, yes, it is a duplicate. I totally forgot about it and made the work again. > Only incremental decoder should return partial results. Other decoders are > strict and (usually) stateless. Yes, there is a incremental decoder. > >>> decoder('\u20ac'.encode('utf8')[:2], 'strict') > > UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: > unexpected end of data >>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2]) ('', 0) |
|
|
msg172535 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-10-09 21:39 |
>>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2]) ('', 0) Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly this. "The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders." Hum, I suppose that the issue cannot be reproduded with TextIOWrapper, just because io.TextIOWrapper and codecs.StreamReader use different buffer sizes. |
|
|
msg172536 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2012-10-09 21:43 |
> Hum no. The bug is an issue in the design of codecs.Stream* classes: incremental decoders and encoders should be used instead of classic decoders/encoders. I don't understand you. StreamReader and IncrementalDecoder both use the same decoder. class IncrementalDecoder(codecs.BufferedIncrementalDecoder): _buffer_decode = codecs.utf_16_le_decode class StreamReader(codecs.StreamReader): decode = codecs.utf_16_le_decode > I don't want to fix this issue: it's better to move to the io library for the reasons listed in the PEP 400. The bug in utf-16 decoder, not in codecs.StreamReader. |
|
|
msg172537 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-10-09 21:46 |
> I don't understand you. Read my last message, I was wrong. |
|
|
msg172558 - (view) |
Author: Walter Dörwald (doerwalter) *  |
Date: 2012-10-10 09:49 |
> >>> codecs.utf_8_decode('\u20ac'.encode('utf8')[:2]) > ('', 0) > > Oh... codecs.CODEC_decode are incremental decoders? I misunderstood completly this. No, those function are not decoders, they're just helper functions used to implement the real incremental decoders. That's why they're undocumented. Whether codecs.utf_8_decode() returns partial results or raises an exception depends on the final argument:: >>> s = '\u20ac'.encode('utf8')[:2] >>> codecs.utf_8_decode(s, 'strict') ('', 0) >>> codecs.utf_8_decode(s, 'strict', False) ('', 0) >>> codecs.utf_8_decode(s, 'strict', True) Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data If you look at encodings/utf_8.py you see that the stateless decoder call codecs.utf_8_decode() with final==True:: def decode(input, errors='strict'): return codecs.utf_8_decode(input, errors, True) so the stateless decoder *will* raise exceptions for partial results. The incremental decoder simply passed on the final argument given to its encode() method. |
|
|