Issue 19539: The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x (original) (raw)

Created on 2013-11-10 02:51 by zuo, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (8)
msg202505 - (view) Author: Jan Kaliszewski (zuo) Date: 2013-11-10 02:51
It seems that the 'raw_unicode_escape' codec: 1) produces data that could be suitable for Python 2.x raw unicode string literals and not for Python 3.x raw unicode string literals (in Python 3.x \u... escapes are also treated literally); 2) seems to be buggy anyway: bytes in range 128-255 are encoded with the 'latin-1' encoding (in Python 3.x it is definitely a bug; and even in Python 2.x the feature is dubious, although at least the Py2's eval() and compile() functions officially accept 'latin-1'-encoded byte strings...). Python 3.3: >>> b = "zażółć".encode('raw_unicode_escape') >>> literal = b'r"' + b + b'"' >>> literal b'r"za\\u017c\xf3\\u0142\\u0107"' >>> eval(literal) Traceback (most recent call last): File "", line 1, in File "", line 1 SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xf3 in position 8: invalid continuation byte >>> b'\xf3'.decode('latin-1') 'ó' >>> b = "zaż".encode('raw_unicode_escape') >>> literal = b'r"' + b + b'"' >>> literal b'r"za\\u017c"' >>> eval(literal) 'za\\u017c' >>> print(eval(literal)) za\u017c It believe that the 'raw_unicode_escape' codes should either be deprecated and later removed or be modified to accept only printable ascii characters. PS. Also, as a side note: neither 'raw_unicode_escape' nor 'unicode_escape' does escape quotes (see issue #7615) -- shouldn't it be at least documented explicitly?
msg202507 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-11-10 07:05
The 'raw_unicode_escape' codec can't be neither removed nor changed because it is used in pickle protocol. Just don't use it if its behavior looks weird for you. Right way to decode raw_unicode_escape-encoded data is use 'raw_unicode_escape' decoder. If a string don't contain quotes, you can use eval(), but you should first decode data from latin1 and encode to UTF-8: >>> literal = ('r"%s"' % "zażółć".encode('raw_unicode_escape').decode('latin1')).encode() >>> literal b'r"za\\u017c\xc3\xb3\\u0142\\u0107"' >>> eval(literal) 'za\\u017có\\u0142\\u0107'
msg202591 - (view) Author: Jan Kaliszewski (zuo) Date: 2013-11-11 00:22
Which means that the description "Produce a string that is suitable as raw Unicode literal in Python source code" is (in Python 3.x) no longer true. So, if change/removal is not possible because of internal significance of the codec, I believe that the description should be changed to something like: "For internal use. This codec *does not* produce anything suitable as a raw string literal in Python 3.x source code."
msg202643 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-11-11 19:19
Jan, the codec implements an encoding which has certain characteristics just like any other codec. It works both in Python 2 and 3 without problems. The documentation is no longer true, though. Ever since we added encoding markers to source files, the raw Unicode string literals depended on this encoding setting. Before this change the docs were fine, since Unicode literals were interpreted as Latin-1 encoded. More correct would be: "Produce a string that uses Unicode escapes to encode non-Latin-1 code points. It is used in the Python pickle protocol."
msg232851 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-12-18 02:27
I included the proposed doc fix in my patch for Issue 19548
msg233010 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2014-12-22 06:01
[Edit Error: 'utf8' codec can't decode byte 0xe2 in position 212: invalid continuation byte] Re-reading the suggested description, it struck me that for encoding, this is redundant with the “backslashreplace” error handler: >>> test = "".join(map(chr, range(sys.maxunicode + 1))) >>> test.encode("raw-unicode-escape") == test.encode("latin-1", "backslashreplace") True However, decoding also seems similar to “unicode_escape”, except that only \uXXXX and \UXXXXXXXX seem to be supported. Maybe there should be a warning that backslashes are not escaped: >>> "\\u005C".encode("raw-unicode-escape").decode("raw-unicode-escape") '\\'
msg233102 - (view) Author: Jan Kaliszewski (zuo) Date: 2014-12-26 00:33
My concerns are now being addressed in the .
msg233147 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-12-28 08:48
This issue is just a documentation issue. The do must be more explicit, explain that the codecs is only used internally by the pickle module, and that its output cannot be used anymore by eval().
History
Date User Action Args
2022-04-11 14:57:53 admin set github: 63738
2014-12-28 08:48:33 vstinner set messages: +
2014-12-27 21:04:02 berker.peksag set superseder: 'codecs' module docs improvementsstage: needs patch -> resolved
2014-12-26 00:33:00 zuo set messages: +
2014-12-26 00:31:41 zuo set status: open -> closedresolution: duplicate
2014-12-22 06:01:56 martin.panter set messages: +
2014-12-18 02:27:23 martin.panter set nosy: + martin.pantermessages: +
2013-11-16 00:42:02 terry.reedy set nosy: + terry.reedy
2013-11-11 19:19:23 lemburg set nosy: + lemburgmessages: + title: The 'raw_unicode_escape' codec buggy + not apropriate for Python 3.x -> The 'raw_unicode_escape' codec buggy + not appropriate for Python 3.x
2013-11-11 18:23:39 serhiy.storchaka set versions: - Python 3.2, Python 3.5nosy: - serhiy.storchakacomponents: - Library (Lib)type: enhancementstage: needs patch
2013-11-11 00:22:07 zuo set versions: + Python 3.2, Python 3.3nosy: + docs@pythonmessages: + assignee: docs@pythoncomponents: + Documentation
2013-11-10 07:05:29 serhiy.storchaka set nosy: + serhiy.storchakamessages: +
2013-11-10 02:51:45 zuo create