Message 102061 - Python tracker (original) (raw)
@lemburg: "failing byte" seems rather obvious: first byte that you meet that is not valid in the current state. I don't understand your explanation, especially "does not have the high bit set". I think you mean "is a valid starter byte". See example 3 below.
Example 1: F1 80 41 42 43. F1 implies a 4-byte character. 80 is OK. 41 is not in 80-BF. It is the "failing byte"; high bit not set. Required action is to emit FFFD then resync on the 41, causing 0041 0042 0043 to be emitted. Total output: FFFD 0041 0042 0043. Current code emits FFFD 0043.
Example 2: F1 80 FF 42 43. F1 implies a 4-byte character. 80 is OK. FF is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the FF. FF is not a valid starter byte, so emit FFFD, and resync on the 42, causing 0042 0043 to be emitted. Total output: FFFD FFFD 0042 0043. Current code emits FFFD 0043.
Example 3: F1 80 C2 81 43. F1 implies a 4-byte character. 80 is OK. C2 is not in 80-BF. It is the "failing byte". Required action is to emit FFFD then resync on the C2. C2 and 81 have the high bit set, but C2 is a valid starter byte, and remaining bytes are OK, causing 0081 0043 to be emitted. Total output: FFFD 0081 0043. Current code emits FFFD 0043.