Message 109155 - Python tracker (original) (raw)

I've found a subtle corner case about 3- and 4-bytes long sequences. For example, according to http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95, table 3.7) the sequences in range \xe0\x80\x80-\xe0\x9f\xbf are invalid. I.e. if the first byte is \xe0 and the second byte is between \x80 (included) and \xA0 (excluded), then the second byte is invalid (this is because sequences < \xe0\xa0\x80 will result in codepoints < U+0800 and these codepoints are already represented by two-bytes-long sequences (\xdf\xbf decodes to U+07FF)).

Assume that we want to decode the string b'\xe0\x61\x80\x61' (where \xe0 is the start byte of a 3-bytes-long sequence, \x61 is the letter 'a' and \x80 a valid continuation byte). This actually results in:

b'\xe0\x61\x80\x61'.decode('utf-8', 'replace') '�a�a' since \x61 is not a valid continuation byte in the sequence:

Now, assume that we want to do the same with b'\xe0\x80\x81\x61': This actually results in:

b'\xe0\x80\x81\x61'.decode('utf-8', 'replace') '��a' in this case \x80 would be a valid continuation byte, but since it's preceded by \xe0 it's not valid. Since it's not valid, the result might be similar to the previous case, i.e.:

This happens because the current algorithm just checks that the second byte (\x80) is in range \x80-\xbf (i.e. it's a continuation byte) and if it is it assumes that the invalid byte is the third (\x81) and replaces the first two bytes (\xe0\x80) with a single �.

That said, the algorithm could be improved to check what is the wrong byte with better accuracy (and that could also be used to give a better error message about decoding surrogates). This shouldn't affect the speed of regular decoding, because the extra check will happen only in case of error. Also note the Unicode standard doesn't seem to mention this case, and that anyway this doesn't "eat" any of the following characters as it was doing before the patch -- the only difference would be in the number of �.