Message 107074 - Python tracker (original) (raw)

I added a test for the 'ignore' error handler. I will commit the patch before the RC unless someone has something against it.

To summarize, the patch updates PyUnicode_DecodeUTF8 from RFC 2279 to RFC 3629, so:

  1. Invalid sequences are now handled as described in http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf (pages 94-95);
  2. 5- and 6-bits-long sequences are now invalid (no changes in behavior, I just removed the "deafult:" of the switch/case and marked them with '0' in the first table);
  3. According to RFC 3629, codepoints in the surrogate range (U+D800-U+DFFF) should be considered invalid, but this would not be backward compatible, so I added code and tests but left them commented away;
  4. I changed the error message "unexpected code byte" to "invalid start byte" and "invalid data" to "invalid continuation byte";
  5. I added an extensive set of tests in test_unicode;
  6. I fixed test_codeccallbacks because it was failing after this change.