msg165829 - (view) |
Author: Antti Laine (aalien) |
Date: 2012-07-19 10:09 |
raw_decode on json.JSONDecoder does not handle leading whitespace. According to RFC 4627, section 2, whitespace can precede an object. With json.loads leading whitespace is handled just fine. d = json.JSONDecoder() d.raw_decode(' {}') ValueError: No JSON object could be decoded Tested with versions 2.6.7, 2.7.3, 3.1.3 and 3.2.3 |
|
|
msg165830 - (view) |
Author: Antti Laine (aalien) |
Date: 2012-07-19 10:16 |
My coworker just submitted a pull request for a possible fix to simplejson on github. https://github.com/simplejson/simplejson/pull/38 |
|
|
msg165831 - (view) |
Author: Jesús Cea Avión (jcea) *  |
Date: 2012-07-19 10:37 |
Please, post a patch for 2.7 and 3.2/3.3, with a test. Seems quite easy. If you hurry, this could go in 3.3.0. |
|
|
msg165832 - (view) |
Author: Antti Laine (aalien) |
Date: 2012-07-19 11:40 |
Suggestion for a patch for 3.3.0. I wasn't quite sure how the testcases were supposed to be loaded. Sorry if I made a mess ;) |
|
|
msg165834 - (view) |
Author: Jesús Cea Avión (jcea) *  |
Date: 2012-07-19 11:54 |
Antti, you are changing the signature of "decode()" and that would be a compatibility problem. Can you rewrite the patch to be more "compatible"? In your test, please, use a bit more complicated object than "{}" :-) |
|
|
msg165835 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2012-07-19 11:56 |
IMO this is not a bug, according to the current documentation it is working as designed. raw_decode says it decodes a string that *starts with* a json document. To my understanding, raw_decode is designed to be used when parsing a stream containing more than just one json document, including possibly non-json text, in which case it makes sense that the calling application would be responsible for (and want to) handle the whitespace around any such document. I recommend closing this as invalid. If the consensus is otherwise and a change is made, I think it would be an enhancement, not a bug fix. |
|
|
msg165836 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2012-07-19 12:03 |
Ah, I see, you are thinking that "json document" includes the possibility of leading whitespace. That is a reasonable interpretation...just make sure that we don't break backward compatibility. |
|
|
msg165918 - (view) |
Author: Antti Laine (aalien) |
Date: 2012-07-20 12:41 |
> you are changing the signature of "decode()" and that would be a compatibility problem I was changing the signature of raw_decode(), by adding a(n almost) private keyword. I really don't see how that would affect compatibility. > you are thinking that "json document" includes the possibility of leading whitespace Yes. While JSON does not have a real standard to follow, I believe that going by the RFC is the best thing to do, and the RFC very clearly states, that objects can be surrounded with whitespace. decode() depends on the functionality of raw_decode(), and its own parameter _w. I have no idea why _w is a parameter. It is not used anywhere in CPython, and, beginning with an underscore, it shouldn't be used from anywhere else. Still it offers the users a possibility to break the behaviour of decode(). If it's a performance hack, making the variable being initialized only once, then I think it's a very poor one. Now the patch leaves decode() as it is, and checks for whitespace in raw_decode(), which leads to checking for whitespace twice when using decode(). I think that a more extensive cleanup would be in order. |
|
|
msg168936 - (view) |
Author: Petri Lehtinen (petri.lehtinen) *  |
Date: 2012-08-23 12:03 |
I think skipping preceding whitespace in raw_decode() wouldn't hurt, but skipping following whitespace would be wrong. David: Any examples of how this could break backwards compatibility? |
|
|
msg168939 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2012-08-23 12:56 |
I didn't have anything specific in mind, just making a general comment about the care that needs to be taken in crafting the fix. |
|
|
msg276143 - (view) |
Author: Bayard Randel (Bayard Randel) * |
Date: 2016-09-13 00:28 |
I've provided an updated patch for 3.6. Additionally I've updated the docstring for decoder.raw_decode to explain that whitespace at the beginning of the document will be ignored. I believe the concerns around the backwards incompatible change to the function signature were addressed by aalien's second patch. |
|
|
msg276161 - (view) |
Author: Robert Collins (rbcollins) *  |
Date: 2016-09-13 04:16 |
I think the patch should either be rejected, or also handle trailing spaces: if we're taking the RFC definition of whitespace not being structural then we should also eat trailing space, which will change the check for extra data in decode to just checking end == s.len(). Obviously a test for the trailing case needs to be added as well. |
|
|
msg276291 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2016-09-13 15:24 |
Robert: as noted, skipping trailing whitespace would be breaking raw_decode's contract. It is designed to parse a json document and return the position of the last character of that document. If there is then another json document in the stream, parsing the remainder of the stream would skip that whitespace. If it is some other data, then the whitespace should be preserved for whatever is parsing the remainder of the stream (if anything). In any case that behavior cannot be changed because it would break backward compatibility. I would not object to rejecting the change, but it does make raw_decode slightly more awkward to use if it does not skip the leading whitespace. So I guess I'm +0.5 on accepting it as is. |
|
|
msg415523 - (view) |
Author: Daniël van Noord (danielnoord) * |
Date: 2022-03-18 20:05 |
@Bayard Randel, do you want to make this patch into a GitHub PR? If not, I could help by doing so and try to get this landed. |
|
|