msg279521 - (view) |
Author: Eric Appelt (Eric Appelt) * |
Date: 2016-10-27 03:06 |
Increase test coverage of the json library, specifically the detect_encoding() function in the __init__ module, which is used to handle the autodetection of the encoding of a bytes object passed to json.loads(). This function was added in issue 17909 which extended the json.loads() function to accept bytes. Note that this is a small patch as I am following section 5 of the developer guide and I am trying to acquaint myself with the process as a first time contributor. I found this missing coverage just by semi-randomly looking at individual modules of interest. Please let me know if I have made any general mistakes in constructing this ticket. Thanks! |
|
|
msg279531 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-10-27 08:40 |
Isn't the test added in enough? |
|
|
msg279537 - (view) |
Author: Eric Appelt (Eric Appelt) * |
Date: 2016-10-27 12:29 |
I looked back and something is clearly wrong with my coverage reporting setup, sorry :( When I move the test introduced in issue 17909, 'test_bytes_decode' from the module Lib/test/test_json/test_unicode.py to Lib/test/test_json/test_decode.py that particular test is properly traced and I see that the function is mostly covered, so I need to figure out what is going wrong in my setup. The test already present is more comprehensive as it includes BOM, so I believe what I wrote is generally not helpful. I did notice that the existing test does not cover the edge-case of a 2-byte utf-16 sequence that can be generated for a valid JSON object represented by a single codepoint, for example '5' or '7'. I created a new patch to augment the existing test to cover this special case rather than adding a test. |
|
|
msg279543 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-10-27 15:55 |
The patch in was written to implement encoding detecting described in RFC 4627 [1]. And the test uses RFC 4627 conforming data. A single codepoint "5" is not valid in RFC 4627, but is valid in RFC 7159 [2]. The comment in your patch is not accurate since '5' is encoded to 1 byte with utf-8 and 4 bytes with utf-32*. [1] https://tools.ietf.org/html/rfc4627 [2] https://tools.ietf.org/html/rfc7159 |
|
|
msg279573 - (view) |
Author: Eric Appelt (Eric Appelt) * |
Date: 2016-10-28 02:28 |
Thanks for the feedback. I agree that the comment is incorrect for several iterations of the loop that really don't need to be tested at all for '5'. I read the previous issue 17909 more carefully along with RFC 4627, 7159, and EMCA 404 to properly understand the context here. I notice that although the issue 17909 patch was written in order to implement the patterns in RFC 4627, it is also valid for the additional pattern introduced for utf-16-le in RFC 7159 as described in [1]. Specifically, in RFC 4627 the only valid pattern for utf-16-le begins with XX 00 XX 00 since the first two characters must represent ASCII codepoints. With strings the second character can be higher codepoints allowing for XX 00 XX XX or XX 00 00 XX. In the implementation from issue 17909 the pattern XX 00 XX is first detected and results in utf-16-le, and then the 4th byte is checked for the pattern XX 00 00 XX or XX 00 00 00 to result in utf-16-le or utf-32 respectively. In the issue 17909 patch the special case of a single character JSON document (i.e. '5') is also specifically accounted for. This case is not mentioned in [1]. So for everything I can try (or think of), this implementation can correctly determine the encoding of any valid JSON document according to RFC 7159. This is good since the documentation [2] makes no distinction that json.loads() would only accept JSON documents in bytes if they adhere to 4627. The encoding failure mode described in issue 17909 is still an invalid JSON document according to RFC 7159 as the control character \u0000 is not escaped. So I think overall the implementation is perfectly valid for RFC 7159 / EMCA 404 as suggested by the standard library documentation [2]. To me it seems reasonable to have a unit test to specifically check that the pattern XX 00 00 XX works. For the goal of hitting 100% test coverage as measured by line, I believe that the 2-byte case still needs to be tested. I have attached a patch that covers these cases along with (maybe too) verbose comments explaining the edge cases. I also changed the comments in the json/__init__.py module itself to clarify the detection patterns as implemented. I'm not sure how helpful this is to the improvement in of the standard library, but it has been very educational to be so far. [1] http://www.ietf.org/mail-archive/web/json/current/msg01959.html [2] https://docs.python.org/3.7/library/json.html |
|
|
msg279591 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-10-28 07:37 |
In general LGTM, but I added few minor comments on Rietveld. Click on the "review" near your patch. You should also receive an email notification, but it can be moved to your Spam folder. |
|
|
msg279628 - (view) |
Author: Eric Appelt (Eric Appelt) * |
Date: 2016-10-28 21:16 |
I was able to inspect the review, and implemented your suggestions. This is attached as a new patch. Please let me know if this is not the correct workflow. Thank you for the prompt review and help as I learn the python tracking and review process! |
|
|
msg279748 - (view) |
Author: Roundup Robot (python-dev)  |
Date: 2016-10-30 21:00 |
New changeset ea4cc65fc0fc by Serhiy Storchaka in branch '3.6': Issue #28541: Improve test coverage for encoding detection in json library. https://hg.python.org/cpython/rev/ea4cc65fc0fc New changeset e9169a8c0692 by Serhiy Storchaka in branch 'default': Issue #28541: Improve test coverage for encoding detection in json library. https://hg.python.org/cpython/rev/e9169a8c0692 |
|
|
msg279749 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2016-10-30 21:01 |
Thank you for your contribution Eric. |
|
|