msg136194 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-05-17 23:10 |
Stateful CJK codecs reset the codec at each call to encode() producing a valid but overlong output: >>> import codecs >>> encoder = codecs.getincrementalencoder('hz')() >>> encoder.encode('\u804a') + encoder.encode('\u804a') b'~{AD~}~{AD~}' >>> '\u804a\u804a'.encode('hz') b'~{ADAD~}' Multibyte encodings: HZ and all encodings of the ISO 2022 family (e.g. iso-2022-jp). Attached patch fixes this issue. I don't like how I added the tests, these tests may be moved somewhere else, but HZ codec doesn't have tests today (I opened issue #12057 for that), and ISO 2022 codecs don't have specific tests (test_multibytecodec is "Unit test for multibytecodec itself"). We should maybe also add tests specific to ISO 2022 first? I hesitate to reset the codec on .encode(text, final=True), but UTF-8-SIG or UTF-16 don't reset the codec if final=True. io.TextIOWrapper only calls encoder.reset() on file.seek(0). On a seek to another position, it calls encoder.setstate(0). See also issues #12016 and #12057. |
|
|
msg136534 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2011-05-22 15:02 |
I think it's better to use a StringIO instance for the tests. Regarding resetting the incremental codec every time .encode() is called: Hye-Shik will have to comment. Perhaps there's an internal reason why they do this. |
|
|
msg136727 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-05-24 09:55 |
> I think it's better to use a StringIO instance for the tests. For which test excatly? An encoder produces bytes, I don't the relation with StringIO. |
|
|
msg136730 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2011-05-24 10:17 |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> I think it's better to use a StringIO instance for the tests. > > For which test excatly? An encoder produces bytes, I don't the relation with StringIO. Sorry, BytesIO in Python3-speak. In Python2 you'd use StringIO. |
|
|
msg136777 - (view) |
Author: Martin (gz) * |
Date: 2011-05-24 18:13 |
Does Victor Stinner have a psychic link with Armin Rigo? :) https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py """ # My theory is that they are not widely used on CPython either, because # I found two bugs just by looking at their .c source: they always call # encreset() after a piece of data, even though I think it's wrong --- # it should be called only once at the end; and mbiencoder_reset() calls # decreset() instead of encreset(). """ The answer to Armin's theory is that they're bugs but not ones users are likely to notice? |
|
|
msg136780 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-05-24 19:04 |
Le mardi 24 mai 2011 à 18:13 +0000, Martin a écrit : > Martin <gzlist@googlemail.com> added the comment: > > Does Victor Stinner have a psychic link with Armin Rigo? :) > > https://bitbucket.org/pypy/pypy/src/7f593e7877d4/pypy/module/_multibytecodec/app_multibytecodec.py > > """ > # My theory is that they are not widely used on CPython either, because > # I found two bugs just by looking at their .c source Sorry, I only found one bug, and while testing HZ, not while reading the source code. > ... and mbiencoder_reset() calls decreset() instead of encreset() This is a new bug that you should be fixed. Armin did not reported the bug upstream (in this bug tracker)? > The answer to Armin's theory is that they're bugs but not ones users are likely to notice? Ok, I will apply my fix. |
|
|
msg136788 - (view) |
Author: Armin Rigo (arigo) *  |
Date: 2011-05-24 20:08 |
Hi :-) I did not report the two issues I found so far because I didn't finish the PyPy implementation of CJK yet, and I'm very new to anything related to codecs; additionally I didn't check Python 3.x, as I was just following the 2.7 sources. Can someone confirm that the two bugs I suspect are really bugs? And should I open another report to help tracking the 2nd bug? |
|
|
msg136791 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-05-24 20:24 |
New changeset bd17396895fb by Victor Stinner in branch '3.1': Issue #12100: Don't reset incremental encoders of CJK codecs at each call to http://hg.python.org/cpython/rev/bd17396895fb New changeset 7f2ab2f95a04 by Victor Stinner in branch '3.2': (Merge 3.1) Issue #12100: Don't reset incremental encoders of CJK codecs at http://hg.python.org/cpython/rev/7f2ab2f95a04 New changeset cb9867dab15e by Victor Stinner in branch 'default': (Merge 3.2) Issue #12100: Don't reset incremental encoders of CJK codecs at http://hg.python.org/cpython/rev/cb9867dab15e |
|
|
msg136792 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2011-05-24 20:29 |
New changeset e789b4cda872 by Victor Stinner in branch '2.7': Issue #12100: Don't reset incremental encoders of CJK codecs at each call to http://hg.python.org/cpython/rev/e789b4cda872 |
|
|
msg136910 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2011-05-25 22:01 |
The initial problem (reset() at each call to .encode()) is fixed in Python 2.7, 3.1, 3.2 and 3.3. I opened a new issue, #12171, for the second problem noticed by Armin (decreset vs encreset). |
|
|