Issue 7615: unicode_escape codec does not escape quotes (original) (raw)

Issue7615

Created on 2010-01-01 03:28 by rhansen, last changed 2022-04-11 14:56 by admin.

Files
File name Uploaded Description Edit
unicode_escape_tests.patch rhansen,2010-01-10 01:43 unit tests
unicode_escape_1_minimal.patch rhansen,2010-01-10 01:50 fix escaping of quotes and backslashes for unicode_escape and raw_unicode_escape
unicode_escape_2_utf-16_invalid_read.patch rhansen,2010-01-10 01:52 fix UTF-16 invalid read bug
unicode_escape_3_check_size_nonnegative.patch rhansen,2010-01-10 01:55 make sure size is nonnegative
unicode_escape_4_eliminate_duplicate_code.patch rhansen,2010-01-10 02:00 eliminate duplicate code
Messages (27)
msg97112 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-01 03:28
The description of the unicode_escape codec says that it produces "a string that is suitable as Unicode literal in Python source code." [1] Unfortunately, this is not true as it does not escape quotes. For example: print u'a\'b"c\'\'\'d"""e'.encode('unicode_escape') outputs: a'b"c'''d"""e I have attached a patch that fixes this issue by escaping single quotes. With the patch applied, the output is: a\'b"c\'\'\'d"""e I chose to only escape single quotes because: 1. it simplifies the patch, and 2. it matches string_escape's behavior. [1] http://docs.python.org/library/codecs.html#standard-encodings
msg97130 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-02 14:46
Richard Hansen wrote: > > New submission from Richard Hansen <rhansen@bbn.com>: > > The description of the unicode_escape codec says that it produces "a > string that is suitable as Unicode literal in Python source code." [1] > Unfortunately, this is not true as it does not escape quotes. For example: > > print u'a\'b"c\'\'\'d"""e'.encode('unicode_escape') > > outputs: > > a'b"c'''d"""e Indeed. Python only uses the decoder of that codec internally. > I have attached a patch that fixes this issue by escaping single quotes. > With the patch applied, the output is: > > a\'b"c\'\'\'d"""e > > I chose to only escape single quotes because: > 1. it simplifies the patch, and > 2. it matches string_escape's behavior. If we change this, the encoder should quote both single and double quotes - simply because it is not known whether the literal will use single or double quotes. The raw_unicode_escape codec would have to be fixed as well.
msg97183 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-03 23:13
> If we change this, the encoder should quote both single and double > quotes - simply because it is not known whether the literal > will use single or double quotes. Or document that single quotes are always escaped so that the user knows he/she can safely use u''. I'm not sure if there is a use case where both would *need* to be escaped, and escaping both has a size penalty. I've attached an untested patch that escapes both. If both are escaped, then the behavior of the string_escape codec should also be changed for consistency (it only escapes single quotes). > The raw_unicode_escape codec would have to be fixed as well. I'm not sure there's anything to fix. Adding backslashes to quotes in raw strings changes the value of the string -- the backslashes prevent the quotes from ending the string literal, but they are not removed when the raw literal is evaluated. Perhaps raw_unicode_escape should be "fixed" by raising an exception when it contains any quotes.
msg97237 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-04 23:56
I thought about raw_unicode_escape more, and there's a way to escape quotes: use unicode escape sequences (e.g., ur'\u0027'). I've attached a new patch that does the following: * backslash-escapes single quotes when encoding with the unicode_escape codec (the original subject of this bug) * replaces single quotes with \u0027 when encoding with the raw_unicode_escape codec (a separate bug not related to the original report, but brought up in comments) * replaces backslashes with \u005c when encoding with the raw_unicode_escape codec (a separate bug not related to the original report) * fixes a corner-case bug where the UTF-16 surrogate pair decoding logic could read past the end of the provided Py_UNICODE character array (a separate bug not related to the original report) * eliminates redundant code in PyUnicode_EncodeRawUnicodeEscape() and unicodeescape_string() * general cleanup in unicodeescape_string() The changes in the patch are non-trivial and have only been lightly tested.
msg97269 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-05 17:51
Attached is a patch to the unicode unit tests. It adds tests for the following: * unicode_escape escapes single quotes * raw_unicode_escape escapes single quotes * raw_unicode_escape escapes backslashes
msg97279 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-05 21:28
Attaching updated unicode_escape_reorg.patch. This fixes two additional issues: * don't call _PyString_Resize() on an empty string because there is only one empty string instance, and that instance is returned when creating an empty string * make sure the size parameter is nonnegative
msg97345 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-07 08:42
Attaching updated unicode_escape_reorg.patch. This addresses two additional issues: * removes pickle's workaround of raw-unicode-escape's broken escaping * eliminates duplicated code (the raw escape encode function was copied with only a slight modification in cPickle.c) With this, all regression tests pass.
msg97346 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-07 09:03
I believe this issue is ready to be bumped to the "patch review" stage. Thoughts?
msg97352 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-01-07 13:02
Does the last patch obsolete the first two? If so please delete the obsolete ones. I imagine there are might be small doc updates required, as well. I haven't looked at the patch itself, but concerning your test patch: your try/except style is unnecessary, I think. Better to just let the syntax error bubble up on its own. After all, you don't *know* that the SyntaxError is because the quotes aren't escaped. I've run into other unit tests in the test suite where the author made such an assumption, which turned out to be false and confused me for a bit while debugging the failure. I had to remove the code producing the mistaken reason message in order to see the real problem. So it's better to just let the real error show up in the first place, in my experience.
msg97363 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-07 19:30
> Does the last patch obsolete the first two? If so please delete the > obsolete ones. Yes and no -- it depends on what the core Python developers want and are comfortable with: * unicode_escape_single_quotes.patch: Only escapes single quotes, simple patch. * unicode_escape_single_and_double_quotes: Superset of the above (also escapes double quotes), but probably unnecessary. Still a relatively simple patch. * unicode_escape_reorg.patch: Superset of unicode_escape_single_quotes.patch that also fixes raw_unicode_escape and other small issues. It's a bigger patch with a greater potential for backwards-compatibility issues. (Pickle is an example: It implemented its own workaround to address raw_unicode_escape's broken escaping, so fixing raw_unicode_escape ended up breaking pickle. The reorg patch removes pickle's workaround, but there will probably be similar workarounds in other existing code.) My preference is to have unicode_escape_reorg.patch committed, but I'm not sure how conservative the core developers are. The release of Python 2.7 is approaching, and they may not want to take on the risk right now. If that's the case, I'd be happy with applying unicode_escape_single_quotes.patch for now and moving unicode_escape_reorg.patch to a new issue report. > I imagine there are might be small doc updates required, as well. Certainly Misc/NEWS will need to be patched. I'm unfamiliar with what else the devs might want for documentation, so I'd love to get some additional guidance. I would also appreciate additional feedback on the technical merits of the reorg patch before investing too much time on updating documentation. > I haven't looked at the patch itself, but concerning your test > patch: your try/except style is unnecessary, I think. Better to > just let the syntax error bubble up on its own. OK, I'll make that change. I added the try/except as an attempt to convert a known ERROR to a FAIL in case that was important for some reason. Thanks for the feedback!
msg97364 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-07 19:39
Attaching updated unit tests for the unicode_escape codec. I removed the raw_unicode_escape tests and will attach a separate patch for those.
msg97365 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-07 19:40
Attaching updated unit tests for the raw_unicode_escape codec.
msg97375 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-07 21:23
Richard Hansen wrote: > > Richard Hansen <rhansen@bbn.com> added the comment: > >> Does the last patch obsolete the first two? If so please delete the >> obsolete ones. > > Yes and no -- it depends on what the core Python developers want and are comfortable with: > * unicode_escape_single_quotes.patch: Only escapes single quotes, simple patch. > * unicode_escape_single_and_double_quotes: Superset of the above (also escapes double quotes), but probably unnecessary. Still a relatively simple patch. > * unicode_escape_reorg.patch: Superset of unicode_escape_single_quotes.patch that also fixes raw_unicode_escape and other small issues. It's a bigger patch with a greater potential for backwards-compatibility issues. (Pickle is an example: It implemented its own workaround to address raw_unicode_escape's broken escaping, so fixing raw_unicode_escape ended up breaking pickle. The reorg patch removes pickle's workaround, but there will probably be similar workarounds in other existing code.) > > My preference is to have unicode_escape_reorg.patch committed, but I'm not sure how conservative the core developers are. The release of Python 2.7 is approaching, and they may not want to take on the risk right now. If that's the case, I'd be happy with applying unicode_escape_single_quotes.patch for now and moving unicode_escape_reorg.patch to a new issue report. We'll need a patch that implements single and double quote escaping for unicode_escape and a \uXXXX style escaping of quotes for the raw_unicode_escape encoder. Other changes are not necessary. The pickle copy of the codec can be left untouched (both cPickle.c and pickle.py) - it doesn't matter whether quotes are escaped or not in the pickle data stream. It's only important that the decoders can reliably decode the additional escapes (which AFAIK, they do automatically anyway). >> I imagine there are might be small doc updates required, as well. > > Certainly Misc/NEWS will need to be patched. I'm unfamiliar with what else the devs might want for documentation, so I'd love to get some additional guidance. I would also appreciate additional feedback on the technical merits of the reorg patch before investing too much time on updating documentation. Misc/NEWS needs an entry which makes the changes clear. The codecs' encode direction is not defined anywhere in the documentation, AFAIK, and basically an implementation detail. The codecs were originally only meant for Python's internal use to decode Python literal byte strings into Unicode. In PEP 100, I only defined the decoding direction. The main idea was to have a set of codecs which read and produce Latin-1 compatible text - Python 2.x used to accept Latin-1 Unicode literals without source code encoding marker. >> I haven't looked at the patch itself, but concerning your test >> patch: your try/except style is unnecessary, I think. Better to >> just let the syntax error bubble up on its own. > > OK, I'll make that change. I added the try/except as an attempt to convert a known ERROR to a FAIL in case that was important for some reason. I'll have to have a look at the patch itself as well. No time for that now, perhaps tomorrow. Thanks, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
msg97385 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-07 22:42
> We'll need a patch that implements single and double quote escaping > for unicode_escape and a \uXXXX style escaping of quotes for the > raw_unicode_escape encoder. OK, I'll remove unicode_escape_single_quotes.patch and update unicode_escape_reorg.patch. > Other changes are not necessary. Would you please clarify? There are a few other (minor) bugs that were discovered while writing unicode_escape_reorg.patch that I think should be fixed: * the UTF-16 surrogate pair decoding logic could read past the end of the provided Py_UNICODE character array if the last character is between 0xD800 and 0xDC00 * _PyString_Resize() will be called on an empty string if the size argument of unicodeescape_string() is 0. This will raise a SystemError because _PyString_Resize() can only be called if the object's ref count is 1 (even if no resizing is to take place) yet PyString_FromStringAndSize() returns a shared empty string instance if size is 0. * it is unclear what unicodeescape_string() should do if size < 0 Beyond those issues, I'm worried about manageability stemming from the amount of code duplication. If a bug is found in one of those encoding functions, the other two will likely need updating. > The pickle copy of the codec can be left untouched (both cPickle.c > and pickle.py) - it doesn't matter whether quotes are escaped or not > in the pickle data stream. Unfortunately, pickle.py must be modified because it does its own backslash escaping before encoding with the raw_unicode_escape codec. This means that backslashes would become double escaped and the decoded value would differ (confirmed by running the pickle unit tests). The (minor) bugs in PyUnicode_EncodeRawUnicodeEscape() are also present in cPickle.c, so they should probably be fixed as well. > The codecs' encode direction is not defined anywhere in the > documentation, AFAIK, and basically an implementation detail. I read the escape codec documentation (see the original post) as implying that the encoders can generate eval-able string literals. I'll add some clarifying statements. Thanks for the feedback!
msg97485 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-10 01:42
Attaching updated unit tests. The tests now check to make sure that single and double quotes are escaped.
msg97486 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-10 01:50
Attaching a minimal patch: * unicode_escape now backslash-escapes single and double quotes * raw_unicode_escape now unicode-escapes single and double quotes * raw_unicode_escape now unicode-escapes backslashes * removes pickle's escaping workarounds so that it continues to work * updates documentation
msg97487 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-10 01:52
Attaching a patch for an issue discovered while looking at the code: * The UTF-16 decode logic in the Unicode escape encoders no longer reads past the end of the provided Py_UNICODE buffer if the last character's value is between 0xD800 and 0xDC00. This patch is meant to be applied after unicode_escape_1_minimal.patch.
msg97488 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-10 01:55
Attaching a patch for another issue discovered while looking at the code: * The Unicode escape encoders now check to make sure that the provided size is nonnegative. * C API documentation updated to make it clear that size must be nonnegative. This patch is meant to be applied after unicode_escape_2_utf-16_invalid_read.patch.
msg97490 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-10 02:00
Attaching a patch that eliminates duplicate code: * Merge unicodeescape_string(), PyUnicode_EncodeRawUnicodeEscape(), and modified_EncodeRawUnicodeEscape() into one function called _PyUnicode_EncodeCustomUnicodeEscape(). This patch is meant to be applied after unicode_escape_3_check_size_nonnegative.patch.
msg98265 - (view) Author: Richard Hansen (rhansen) * Date: 2010-01-25 05:44
Any comments on the patches? I'd love to see at least patches 1-3 make it into Python 2.7. :)
msg98275 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-25 09:21
Richard Hansen wrote: > > Richard Hansen <rhansen@bbn.com> added the comment: > > Any comments on the patches? I'd love to see at least patches 1-3 make it into Python 2.7. :) Sorry, I haven't had a chance to review them yet. Will try today.
msg98288 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-01-25 13:54
I feel uneasy to change the default unicode-escape encoding. I think that we mix two features here; to transfer a unicode string between two points, programs must agree on where the data ends, and how characters are represented as bytes. All codecs including unicode-escape only dealt with byte conversion; (unicode-escape converts everything to printable 7bit ascii); these patches want to add a feature related to the "where does the string end" issue, and is only aimed at "python code" containers. Other transports and protocols may choose different delimiters. My point is that unicode-escape used to not change printable 7-bit ascii characters, and the patches will change this. And actually this will break existing code. It did not take me long to find two examples of programs which embed unicode_escape-encoded text between quotes, and take care themselves of escaping quotes. First example generates javascript code, the second generates SQL statements: http://github.com/chriseppstein/pywebmvc/blob/master/src/code/pywebmvc/tools/searchtool.py#L450 http://gitweb.sabayon.org/?p=entropy.git;a=blob;f=libraries/entropy/db/__init__.py;h=2d818455efa347f35b2e96d787fefd338055d066;hb=HEAD#l6463 This does not prevent the creation of a new codec, let's call it 'python-unicode-escape' [ or 'repr' :-) ]
msg98327 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-01-26 11:48
Amaury Forgeot d'Arc wrote: > I feel uneasy to change the default unicode-escape encoding. > I think that we mix two features here; to transfer a unicode string between two points, programs must agree on where the data ends, and how characters are represented as bytes. > All codecs including unicode-escape only dealt with byte conversion; (unicode-escape converts everything to printable 7bit ascii); > these patches want to add a feature related to the "where does the string end" issue, and is only aimed at "python code" containers. Other transports and protocols may choose different delimiters. > > My point is that unicode-escape used to not change printable 7-bit ascii characters, and the patches will change this. > > And actually this will break existing code. It did not take me long to find two examples of programs which embed unicode_escape-encoded text between quotes, and take care themselves of escaping quotes. First example generates javascript code, the second generates SQL statements: > http://github.com/chriseppstein/pywebmvc/blob/master/src/code/pywebmvc/tools/searchtool.py#L450 > http://gitweb.sabayon.org/?p=entropy.git;a=blob;f=libraries/entropy/db/__init__.py;h=2d818455efa347f35b2e96d787fefd338055d066;hb=HEAD#l6463 Ouch... these codecs should not have been used outside Python. I wonder why these applications don't use repr(text) to format the JavaScript/SQL strings. I guess this is the result of documenting them in http://docs.python.org/library/codecs.html#standard-encodings Too bad that the docs actually say "Produce a string that is suitable as Unicode literal in Python source code." The codecs main intent was to *decode* Unicode literals in Python source code to Unicode objects... The naming in the examples you mention also suggest that the programmers used the table from the docs - they use "unicode_escape" as codec name, not the standard "unicode-escape" name which we use throughout the Python code. The fact that the demonstrated actual use already does apply the extra quote escaping suggests that we cannot easily add this to the existing codecs. It would break those applications, since they'd be applying double-escaping. > This does not prevent the creation of a new codec, let's call it 'python-unicode-escape' [ or 'repr' :-) ] I think that's a good idea to move forward. Python 3.x comes with a new Unicode repr() format which we could turn into a new codec: it automatically adds the quotes, processes the in-string quotes and backslashes and also escapes \t, \n and \r as well as all non-printable code points. As for naming the new codec, I'd suggest "unicode-repr" since that's what it implements.
msg111826 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-28 15:51
Could we please have some responses to as there are some very positive comments there.
msg111836 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-07-28 18:47
Mark Lawrence wrote: > > Mark Lawrence <breamoreboy@yahoo.co.uk> added the comment: > > Could we please have some responses to as there are some very positive comments there. A patch implementing the suggestions would be even better :-)
msg399287 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-08-09 19:30
Looks like this has been fixed by now: >>> print(u'a\'b"c\'\'\'d"""e'.encode('unicode_escape')) b'a\'b"c\'\'\'d"""e' Let me know if there is a reason not to close this issue.
msg399288 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021-08-09 19:45
Nothing was changed. Backslashes in your output are backslashes in the bytes object repr. >>> print('a\'b"c\'\'\'d"""e'.encode('unicode_escape').decode()) a'b"c'''d"""e
History
Date User Action Args
2022-04-11 14:56:56 admin set github: 51864
2021-08-09 20:06:19 iritkatriel set resolution: out of date -> versions: + Python 3.9, Python 3.10, Python 3.11, - Python 2.6, Python 2.7
2021-08-09 19:45:44 serhiy.storchaka set status: pending -> opennosy: + serhiy.storchakamessages: +
2021-08-09 19:30:03 iritkatriel set status: open -> pendingnosy: + iritkatrielmessages: + resolution: out of date
2014-02-03 19:19:32 BreamoreBoy set nosy: - BreamoreBoy
2010-07-28 18:47:56 lemburg set messages: +
2010-07-28 15:51:57 BreamoreBoy set nosy: + BreamoreBoymessages: +
2010-01-26 11:48:54 lemburg set messages: +
2010-01-25 13:54:01 amaury.forgeotdarc set nosy: + amaury.forgeotdarcmessages: +
2010-01-25 09:46:43 flox set nosy: + flox
2010-01-25 09:21:37 lemburg set messages: +
2010-01-25 05:44:10 rhansen set messages: +
2010-01-10 02:00:06 rhansen set files: + unicode_escape_4_eliminate_duplicate_code.patchmessages: +
2010-01-10 01:55:16 rhansen set files: + unicode_escape_3_check_size_nonnegative.patchmessages: +
2010-01-10 01:52:41 rhansen set files: + unicode_escape_2_utf-16_invalid_read.patchmessages: +
2010-01-10 01:50:15 rhansen set files: + unicode_escape_1_minimal.patchmessages: +
2010-01-10 01:43:00 rhansen set files: + unicode_escape_tests.patchmessages: +
2010-01-10 01:39:24 rhansen set files: - raw_unicode_escape_tests.patch
2010-01-10 01:39:21 rhansen set files: - unicode_escape_tests.patch
2010-01-10 01:39:19 rhansen set files: - unicode_escape_reorg.patch
2010-01-10 01:39:15 rhansen set files: - unicode_escape_single_and_double_quotes.patch
2010-01-07 22:43:15 rhansen set files: - unicode_escape_single_quotes.patch
2010-01-07 22:42:57 rhansen set messages: +
2010-01-07 21:23:44 lemburg set messages: +
2010-01-07 19:40:04 rhansen set files: + raw_unicode_escape_tests.patchmessages: +
2010-01-07 19:39:19 rhansen set files: + unicode_escape_tests.patchmessages: +
2010-01-07 19:34:33 rhansen set files: - unicode_escape_tests.patch
2010-01-07 19:30:39 rhansen set messages: +
2010-01-07 13:02:06 r.david.murray set nosy: + r.david.murraymessages: + stage: test needed -> patch review
2010-01-07 09:03:11 rhansen set messages: +
2010-01-07 08:42:05 rhansen set files: + unicode_escape_reorg.patchmessages: +
2010-01-07 08:19:51 rhansen set files: - unicode_escape_reorg.patch
2010-01-05 21:28:07 rhansen set files: + unicode_escape_reorg.patchmessages: +
2010-01-05 20:46:47 rhansen set files: - unicode_escape_reorg.patch
2010-01-05 17:51:53 rhansen set files: + unicode_escape_tests.patchmessages: +
2010-01-04 23:56:31 rhansen set files: + unicode_escape_reorg.patchmessages: +
2010-01-03 23:13:06 rhansen set files: + unicode_escape_single_and_double_quotes.patchmessages: +
2010-01-02 14:46:42 lemburg set messages: +
2010-01-01 14:08:37 pitrou set nosy: + lemburg
2010-01-01 05:04:30 ezio.melotti set priority: normalnosy: + ezio.melottistage: test needed
2010-01-01 03:28:23 rhansen create