Issue 1313939: Speedup PyUnicode_DecodeCharmap - Python tracker (original) (raw)

Created on 2005-10-05 15:01 by doerwalter, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
diff.txt doerwalter,2005-10-05 15:01
diff2.txt doerwalter,2005-10-06 14:45
diff3.txt doerwalter,2005-10-06 15:50
Messages (13)
msg48824 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-10-05 15:01
This patch speeds up PyUnicode_DecodeCharmap() as discussed in the thread: http://mail.python.org/pipermail/python-dev/2005-October/056958.html It makes it possible to pass a unicode string to cPyUnicode_DecodeCharmap() in addition to the dictionary which is still supported. The unicode character at position i in the string is used as the decoded value for byte i. Byte values greater that the length of the string and u"\ufffd" characters in the string are treated as "maps to undefined".
msg48825 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-10-05 17:50
Logged In: YES user_id=38388 The patch looks good, but I'd still like to see whether Hye-Shik's fastmap codec wouldn't be a better and more general solution since it seems to also provide good performance for encoding Unicode strings. That said, you should use a non-code point such as 0xFFFE for meaning "undefined mapping". The Unicode replacement character is not a good choice as this is a very valid character which often actually gets used to replace characters for which no Unicode code point is known.
msg48826 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2005-10-05 18:36
Logged In: YES user_id=21627 For decoding, Walter's code is nearly identical to the fastmap decoder: both use a Py_UNICODE array to represent the map, and both use REPLACEMENT CHARACTER to denote a missing target code. I find the use of U+FFFD highly appropriate, and not at all debatable. None of the existing codecs maps any of its characters to U+FFFD, and I would consider it a bug if one did. REPLACEMENT CHARACTER should only be used if there is no approprate character, so no charmap should claim that the appropriate mapping for some by is that character. That you often use U+FFFD in output to denote unmappable characters is a different issue, indeed, Python's "replace" mode does so. It would continue to do so under this patch.
msg48827 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-10-05 20:39
Logged In: YES user_id=38388 The whole point is finding a replacement code point for the None value in dictionaries. Since None is not a character, a code point should be chosen that is guaranteed to never be assigned. FFFE is such a code point, hence the choice. FFFD is an assigned code point. Note that a mapping to FFFE will always raise an exception and the codec user can then decide to use the replace error handler to have the codec use FFFD instead. It is also very reasonable for a codec to map some characters to FFFD to avoid invoking the exception handling in those cases.
msg48828 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-10-06 14:45
Logged In: YES user_id=89016 OK, I've updated the patch to include an update of the documentation and a few test and I've simplified codecs.make_maps a bit, since we'll always have a decoding string. 0xFFFD is still used as the undefined marker value.
msg48829 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-10-06 15:24
Logged In: YES user_id=38388 Thanks, Walter. However, you won't get my approval with the choice of FFFD as meaning "undefined mapping" - that code point is defined. Please choose a different code that is documented to never be defined by the Unicode standard. Also, please explain the new alias 'unicode_1_1_utf_7' : 'utf_7'. About the make_maps() function: the decoding maps should be generated by the gencodec.py script instead of doing this a import time. The dictionaries can then be removed from the codecs.
msg48830 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-10-06 15:50
Logged In: YES user_id=89016 > Thanks, Walter. However, you won't get my approval with the > choice of FFFD as meaning "undefined mapping" - that code > point is defined. Please choose a different code that is > documented to never be defined by the Unicode standard. OK, I've updated the patch to use 0xfffe instead. Note that this only work as long as u"\fffe" is a legal Unicode literal. > Also, please explain the new alias 'unicode_1_1_utf_7' : > 'utf_7'. Oops, that was for bug #1245379. Removed > About the make_maps() function: the decoding maps should be > generated by the gencodec.py script instead of doing this a > import time. The dictionaries can then be removed from the > codecs. That's true. I'm still working on that. Do you have any tips on how to do that (what files do I have to download and where do I have to put it and how (and from where) do I have to call gencodec.py). Is this documented somewhere?
msg48831 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-10-06 16:02
Logged In: YES user_id=38388 Thanks for the change. I can regenerate the codecs using gencodec.py, no problem. I can also change it to create the string mapping. For reference: the mapping files can be downloaded from ftp.unicode.org. The gencodec.py script then takes the mapping filename as argument and creates a codec .py file from it. Special care has to be taken in that some codecs contains hand-edited details. Note that it's likely that some codecs will have additions or removals - the files on the ftp.unicode.org are updated every now and then and usually contain more up-to-date mappings. For some codecs, you won't find mapping files on the Unicode site - these were then contributed by 3rd parties.
msg48832 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-10-06 16:59
Logged In: YES user_id=89016 > I can regenerate the codecs using gencodec.py, no problem. I > can also change it to create the string mapping. That would be great. So should I check in everything else (i.e. unicodeobject.c and the doc and test changes)?
msg48833 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-10-06 18:37
Logged In: YES user_id=38388 Yes, please.
msg48834 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-10-06 20:32
Logged In: YES user_id=89016 Checked in as: Objects/unicodeobject.c 2.232 Lib/test/test_codecs.py 1.27 Doc/api/concrete.tex 1.68 Misc/NEWS 1.1387
msg48835 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2005-10-09 19:23
Logged In: YES user_id=89016 Assigning to MAL for the codec update
msg48836 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2005-11-02 07:12
Logged In: YES user_id=33168 MAL, didn't you update the codecs? Is there anything left to do or can this be closed?
History
Date User Action Args
2022-04-11 14:56:13 admin set github: 42448
2005-10-05 15:01:51 doerwalter create