The test at Lib/test/test_multibytecodec.py:178 checks for len('\U00012345') == 2, and with PEP393 this is always False. I tried to run the tests with a few changes and they seem to work, but the code doesn't raise any exception on c.reset(): ---->8-------->8-------->8-------->8---- import io, codecs s = io.BytesIO() c = codecs.getwriter('gb18030')(s) c.write('123'); s.getvalue() c.write('\U00012345'); s.getvalue() c.write('\U00012345' + '\uac00\u00ac'); s.getvalue() c.write('\uac00'); s.getvalue() c.reset() s.getvalue() ---->8-------->8-------->8-------->8---- Result: >>> import io, codecs >>> s = io.BytesIO() >>> c = codecs.getwriter('gb18030')(s) >>> c.write('123'); s.getvalue() b'123' >>> c.write('\U00012345'); s.getvalue() b'123\x907\x959' >>> # '\U00012345'[0] is the same of '\U00012345' now >>> c.write('\U00012345' + '\uac00\u00ac'); s.getvalue() b'123\x907\x959\x907\x959\x827\xcf5\x810\x851' >>> c.write('\uac00'); s.getvalue() b'123\x907\x959\x907\x959\x827\xcf5\x810\x851\x827\xcf5' >>> c.reset() # is this supposed to raise an error? >>> s.getvalue() b'123\x907\x959\x907\x959\x827\xcf5\x810\x851\x827\xcf5' Victor suggested to wait until multibytecodec gets ported to the new API before fixing this.
Victor, do you know if multibytecodec has been ported to the new API yet? If I removed the "if", I still get a failure. test test_multibytecodec failed -- Traceback (most recent call last): File "/home/wolf/dev/py/py3k/Lib/test/test_multibytecodec.py", line 187, in test_gb18030 self.assertEqual(s.getvalue(), b'123\x907\x959') AssertionError: b'123\x907\x959\x907\x959' != b'123\x907\x959'
I think these tests have no sense after PEP393. They tests that StreamWriter works with non-BMP characters broken inside surrogate pair. I.e. c.write(s[:i]); c.write(s[i:]) always is same as c.write(s), even if i breaks s inside a surrogate pair. This case is impossible after PEP393.
CJK decoders use the new Unicode API since the changeset bcecf3910162. "I think these tests have no sense after PEP393. They tests that StreamWriter works with non-BMP characters broken inside surrogate pair. I.e. c.write(s[:i]); c.write(s[i:]) always is same as c.write(s), even if i breaks s inside a surrogate pair. This case is impossible after PEP393." I reenabled tests, but I simplified them to remove parts related to surrogate pairs. Tests are shorter than before, but it's better than no test at all. Can I close the issue or someone wants to improve these tests?