Message 109155 - Python tracker (original) (raw)

I've found a subtle corner case about 3- and 4-bytes long sequences. For example, according to http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95, table 3.7) the sequences in range \xe0\x80\x80-\xe0\x9f\xbf are invalid. I.e. if the first byte is \xe0 and the second byte is between \x80 (included) and \xA0 (excluded), then the second byte is invalid (this is because sequences < \xe0\xa0\x80 will result in codepoints < U+0800 and these codepoints are already represented by two-bytes-long sequences (\xdf\xbf decodes to U+07FF)).

Assume that we want to decode the string b'\xe0\x61\x80\x61' (where \xe0 is the start byte of a 3-bytes-long sequence, \x61 is the letter 'a' and \x80 a valid continuation byte). This actually results in:

b'\xe0\x61\x80\x61'.decode('utf-8', 'replace') '�a�a' since \x61 is not a valid continuation byte in the sequence:

\xe0 is converted to �;
\x61 is displayed correctly as 'a';
\x80 is valid only as a continuation byte and invalid alone, so it's replaced by �;
\x61 is displayed correctly as 'a';

Now, assume that we want to do the same with b'\xe0\x80\x81\x61': This actually results in:

b'\xe0\x80\x81\x61'.decode('utf-8', 'replace') '��a' in this case \x80 would be a valid continuation byte, but since it's preceded by \xe0 it's not valid. Since it's not valid, the result might be similar to the previous case, i.e.:

\xe0 is converted to �;
\x80 is valid as a continuation byte but not in this specific case, so it's replaced by �;
\x81 is valid only as a continuation byte and invalid alone, so it's replaced by �;
\x61 is displayed correctly as 'a'; However for this case (and the other similar cases), the invalid bytes wouldn't be otherwise valid because they are still in range \x80-\xbf (continuation bytes), so the current behavior might be fine.

This happens because the current algorithm just checks that the second byte (\x80) is in range \x80-\xbf (i.e. it's a continuation byte) and if it is it assumes that the invalid byte is the third (\x81) and replaces the first two bytes (\xe0\x80) with a single �.

That said, the algorithm could be improved to check what is the wrong byte with better accuracy (and that could also be used to give a better error message about decoding surrogates). This shouldn't affect the speed of regular decoding, because the extra check will happen only in case of error. Also note the Unicode standard doesn't seem to mention this case, and that anyway this doesn't "eat" any of the following characters as it was doing before the patch -- the only difference would be in the number of �.