Issue 15379: Charmap decoding of no-BMP characters (original) (raw)

Created on 2012-07-17 08:15 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
decode_charmap_maxchar-3.3_2.patch serhiy.storchaka,2012-09-21 20:14 Patch for 3.3 review
decode_charmap_maxchar-3.2_2.patch serhiy.storchaka,2012-09-21 20:14 Patch for 3.2 review
decode_charmap_maxchar-2.7.patch serhiy.storchaka,2012-10-02 16:39 Patch for 2.7 review
Messages (16)
msg165688 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-17 08:15
Yet one inconsistency in charmap codec. >>> import codecs >>> codecs.charmap_decode(b'\x00', 'strict', '\U0002000B') ('𠀋', 1) >>> codecs.charmap_decode(b'\x00', 'strict', {0: '\U0002000B'}) ('𠀋', 1) >>> codecs.charmap_decode(b'\x00', 'strict', {0: 0x2000B}) Traceback (most recent call last): File "", line 1, in TypeError: character mapping must be in range(65536) The suggested patch removes this unnecessary limitation in charmap decoder.
msg165690 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-07-17 08:54
Could you add a test to your patch? Is the issue 3.3-specific?
msg165710 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-17 11:36
Fixing for 3.2 and lesser is possible, but expensive, because of narrow build limitation. If necessary, I will give the patch, but it is easier to mark it as "wont fix" for pre-3.3 versions. Here is a tests for charmap decoding. Tests added not only for this issue, but for all non-covered cases with int2str and int2str mappings.
msg165753 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-07-18 11:02
In 3.2, narrow build is also broken when the "charmap" is a string: >>> codecs.charmap_decode(b'\0', 'strict', '\U0002000B') returns ('𠀋', 1) with a wide unicode build, but ('\ud840', 1) with a narrow build. 3.2 could be fixed to allow characters up to sys.maxunicode, though.
msg165786 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-18 15:48
Well, here is a patch for 3.2.
msg165796 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-07-18 20:26
About the patch for 3.2: "needed = 6 - extrachars" Where does this 6 come from? There is another part which uses this "extrachars". Why is the allocation strategy different here?
msg165798 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-07-18 20:33
It's the same strategy. "needed = (targetsize - extrachars) + (targetsize << 2)". targetsize == 2.
msg165801 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-07-18 21:07
Ah, I was worried by the possible quadratic behavior. So the other (existing) case is quadratic as well (I was mislead by the <<, which made me think there is something clever there). That's good enough for 3.2, I guess.
msg170567 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-09-16 18:43
Ping.
msg170913 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-09-21 20:14
Patches updated. Added a few new tests, used MAX_UNICODE, a little changed extrachars grow step.
msg171069 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-09-23 18:01
New changeset 620d23f7ad41 by Antoine Pitrou in branch '3.2': Issue #15379: Fix passing of non-BMP characters as integers for the charmap decoder (already working as unicode strings). http://hg.python.org/cpython/rev/620d23f7ad41 New changeset c64dec45d46f by Antoine Pitrou in branch 'default': Issue #15379: Fix passing of non-BMP characters as integers for the charmap decoder (already working as unicode strings). http://hg.python.org/cpython/rev/c64dec45d46f
msg171070 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-09-23 18:02
Thank you, I've committed the patches. There was a test failure in test_codeccallbacks in 3.2, which I fixed simply by replacing sys.maxunicode with a hardcoded 0x110000.
msg171814 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-02 16:39
We forgot about 2.7 (because I had not thought to apply it even for a 3.2). Here is backported patch.
msg173356 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-19 19:05
The 2.7 patch is just a backport of 3.2 patch (including the last Antoine's fix). Please look and commit.
msg175802 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-11-17 20:17
New changeset c7ce91756472 by Antoine Pitrou in branch '2.7': Issue #15379: Fix passing of non-BMP characters as integers for the charmap decoder (already working as unicode strings). http://hg.python.org/cpython/rev/c7ce91756472
msg175803 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-11-17 20:17
Thanks for the backport, committed!
History
Date User Action Args
2022-04-11 14:57:32 admin set github: 59584
2012-11-17 20:17:42 pitrou set stage: commit review -> resolved
2012-11-17 20:17:36 pitrou set status: open -> closedmessages: +
2012-11-17 20:17:14 python-dev set messages: +
2012-10-24 09:10:51 serhiy.storchaka set stage: resolved -> commit reviewversions: - Python 3.2, Python 3.3
2012-10-19 19:05:41 serhiy.storchaka set messages: +
2012-10-02 16:40:09 serhiy.storchaka set versions: + Python 2.7
2012-10-02 16:39:25 serhiy.storchaka set status: closed -> openfiles: + decode_charmap_maxchar-2.7.patchmessages: +
2012-09-23 18:02:19 pitrou set stage: patch review -> resolved
2012-09-23 18:02:05 pitrou set status: open -> closedresolution: fixedmessages: +
2012-09-23 18:01:12 python-dev set nosy: + python-devmessages: +
2012-09-21 20:16:26 serhiy.storchaka set files: - decode_charmap_maxchar-3.2.patch
2012-09-21 20:16:16 serhiy.storchaka set files: - decode_charmap_tests.patch
2012-09-21 20:16:09 serhiy.storchaka set files: - decode_charmap_maxchar.patch
2012-09-21 20:14:09 serhiy.storchaka set files: + decode_charmap_maxchar-3.3_2.patch, decode_charmap_maxchar-3.2_2.patchmessages: +
2012-09-16 18:43:26 serhiy.storchaka set messages: +
2012-08-05 10:48:52 serhiy.storchaka set stage: needs patch -> patch review
2012-08-05 10:47:32 serhiy.storchaka set keywords: + needs reviewpriority: normal -> lowstage: patch review -> needs patch
2012-07-18 21:07:56 amaury.forgeotdarc set messages: +
2012-07-18 20:33:37 serhiy.storchaka set messages: +
2012-07-18 20:26:53 amaury.forgeotdarc set messages: +
2012-07-18 15:48:30 serhiy.storchaka set files: + decode_charmap_maxchar-3.2.patchmessages: + versions: + Python 3.2
2012-07-18 11:02:19 amaury.forgeotdarc set nosy: + amaury.forgeotdarcmessages: +
2012-07-17 11:36:02 serhiy.storchaka set files: + decode_charmap_tests.patchmessages: +
2012-07-17 08:54:29 pitrou set nosy: + lemburg, pitrou, vstinner, benjamin.peterson, ezio.melottimessages: + stage: patch review
2012-07-17 08:15:57 serhiy.storchaka create