Issue 20574: Implement incremental decoder for cp65001 (original) (raw)

Created on 2014-02-09 13:18 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
incremental_cp_utf8.patch vstinner,2014-02-09 13:18 review
Messages (9)
msg210759 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-09 13:18
(Follow up of issue #20538 and #20571.) Attached patch implements incremental decoders for multibyte code pages (on Windows), especially for CP_UTF8 aka "cp65001" in Python. Code pages 932, 936, 949, 950 and 1361 already have an incremental decoder since: --- changeset: 38817:549c547700af branch: legacy-trunk user: Martin v. Löwis <martin@v.loewis.de> date: Wed Jun 14 05:21:04 2006 +0000 files: Doc/api/concrete.tex Include/unicodeobject.h Lib/encodings/mbcs.py Misc/NEWS Modules/_codecsmodule.c Objects/unicodeobject.c description: Patch #1455898: Incremental mode for "mbcs" codec. --- Python currently uses IsDBCSLeadByteEx(): http://msdn.microsoft.com/en-us/library/windows/desktop/dd318667%28v=vs.85%29.aspx And CharPrevA(): http://msdn.microsoft.com/en-us/library/windows/desktop/ms647471%28v=vs.85%29.aspx But IsDBCSLeadByteEx() only supports code pages 932, 936, 949, 950 and 1361. Python supports the code page 65001 (codec "cp65001") since Python 3.3. New tests on incremental decoders were added in Python 3.4: I addedd a skip for cp65001 since it was not supported (#20571). This issue implements the incremental decoder and so removes the skip. I prefer to wait for Python 3.5 (not rush for add this new feature after 3.4 beta 3). cp65001 is mostly used for output (sys.stdout/sys.stderr) on Windows, not for input.
msg210764 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-09 13:50
Nice. Could you please also add test_partial for CP65001 (if this will make sense)? What is performance regression of this patch? I considered this issue as a bug. And if performance regression is not too big, I think it can be applied to 3.3+. Otherwise a warning should be added that CP65001 doesn't not work with input text streams.
msg210783 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-02-09 19:47
It might be faster, or (more likely) has zero impact on performances.
msg213905 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-03-17 22:12
New changeset 08f9b881f78c by Victor Stinner in branch 'default': Issue #20574: Implement incremental decoder for cp65001 code http://hg.python.org/cpython/rev/08f9b881f78c
msg213906 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-03-17 22:17
New changeset 85b87789f048 by Victor Stinner in branch 'default': Issue #20574: Add more tests for cp65001 http://hg.python.org/cpython/rev/85b87789f048
msg213907 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-03-17 22:27
> Could you please also add test_partial for CP65001 (if this will make sense)? I added CP65001Test which inherit from UTF8Test and so runs all UTF-8 tests on cp65001 codec. I'm surprised that the test pass.
msg213908 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-03-17 22:28
I don't feel the need to backport the new feature, so I'm closing the issue.
msg213923 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-03-18 00:40
New changeset f6794a0fb2b3 by Victor Stinner in branch 'default': Issue #20574: Remove duplicated test failing on Windows XP http://hg.python.org/cpython/rev/f6794a0fb2b3
msg213926 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-03-18 00:51
I removed the test because there were two classes tesing the same codec and that tests were failing. I need to refactor tests, and so I reopen the issue. http://buildbot.python.org/all/builders/x86%20XP-4%203.x/builds/10291/steps/test/logs/stdio ====================================================================== FAIL: test_lone_surrogates (test.test_codecs.CP65001Test) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 773, in test_lone_surrogates super().test_lone_surrogates() File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\test\test_codecs.py", line 349, in test_lone_surrogates self.assertRaises(UnicodeEncodeError, "\ud800".encode, self.encoding) AssertionError: UnicodeEncodeError not raised by encode
History
Date User Action Args
2022-04-11 14:57:58 admin set github: 64773
2015-03-18 13:22:37 vstinner set status: open -> closedresolution: fixed
2014-03-18 00:51:37 vstinner set status: closed -> openresolution: fixed -> (no value)messages: +
2014-03-18 00:40:31 python-dev set messages: +
2014-03-17 22:28:26 vstinner set status: open -> closedresolution: fixedmessages: +
2014-03-17 22:27:50 vstinner set messages: +
2014-03-17 22:17:43 python-dev set messages: +
2014-03-17 22:12:28 python-dev set nosy: + python-devmessages: +
2014-02-09 19:47:43 vstinner set messages: +
2014-02-09 13:50:56 serhiy.storchaka set messages: +
2014-02-09 13🔞25 vstinner create