msg83434 - (view) |
Author: Dan Mahn (dmahn) |
Date: 2009-03-10 14:45 |
urllib.parse.urlencode() uses quote_plus() extensively to create a complete query string, but doesn't effectively/properly take advantage of the flexibility built into quote_plus(). Namely: 1) Instances of type "bytes" are not properly encoded, as str() is used prior to passing to quote_plus(). This creates a nonsensical string such as b'1234', while quote_plus() can handle these types properly if passed intact. The ability to encode this type is particularly useful for putting binary data into the query string, or for pre-encoded text which you may want to encode in a non-standard character encoding. 2) Sometimes it would be desirable to encode query strings entirely in "latin-1" or possibly "ascii" instead of "utf-8". Adding the extra parameters now present on quote_plus() can easily give that extra functionality. I have attached a new version of urlencode() that provides both of the above fixes/enhancements. Additionally, an unused codepath in the existing function has been eliminated/cleaned up. Some doctests are included as well. |
|
|
msg83448 - (view) |
Author: Dan Mahn (dmahn) |
Date: 2009-03-10 22:20 |
I also made some tests for the new code that could be added to the unit tests in test_urllib.py |
|
|
msg84216 - (view) |
Author: Jeremy Hylton (jhylton)  |
Date: 2009-03-26 20:57 |
I'm not sure I understand the part of the code that deals with binary strings. I agree the current behavior is odd. RFC 2396 says that non-ascii characters must be encoded as utf-8 and then percent escaped. In the test case you started with, you encoded b'\xa0\x24'. It doesn't seem like this should be allowed, since it is not valid utf-8. |
|
|
msg84228 - (view) |
Author: Dan Mahn (dmahn) |
Date: 2009-03-26 22:27 |
Hello. Thanks for the feedback. With regards to RFC 2396, I see this: http://www.ietf.org/rfc/rfc2396.txt ==== There is a second translation for some resources: the sequence of octets defined by a component of the URI is subsequently used to represent a sequence of characters. A 'charset' defines this mapping. There are many charsets in use in Internet protocols. For example, UTF-8 [UTF-8] defines a mapping from sequences of octets to sequences of characters in the repertoire of ISO 10646. ==== To me, that text does not indicate that URLs are always encoded in UTF-8. It indicates that URL information may be encoded in character sets ('charset') other than ASCII, and when it is, the values must be sent as escaped values. Here, I note the specific words "many charsets in use" and "For example", before the reference to UTF-8. I have also done a few tests, and have found that in practice, browsers do not always encode URLs as UTF-8. This actually seems to differ as to what part of the URL is being encoded. For instance, my Firefox will encode the path portion of a URL as UTF-8, but encode the query string as Latin-1. I think that the general idea is ... URL data must be encoded into ASCII, but as to what the data is that is being encoded ... That may be of some "charset" which may be application-defined. And in the most general sense, I would argue that the data could simply be binary data. (Actually, Latin-1 pretty much uses all the codes from 0 to 255, so it's very much like plain binary data anyway.) I hope that clarifies what I am reading in RFC 2396. In addition, quote_plus() already handles all the cases I placed into urlencode(). I suppose the actual test cases may be debatable, but I did specifically choose tests with data which would be recognized as something other then UTF-8. Jeremy Hylton wrote: > Jeremy Hylton <jeremy@alum.mit.edu> added the comment: > > I'm not sure I understand the part of the code that deals with binary > strings. I agree the current behavior is odd. RFC 2396 says that > non-ascii characters must be encoded as utf-8 and then percent escaped. > In the test case you started with, you encoded b'\xa0\x24'. It doesn't > seem like this should be allowed, since it is not valid utf-8. > > ---------- > nosy: +jhylton > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue5468> > _______________________________________ |
|
|
msg84260 - (view) |
Author: Jeremy Hylton (jhylton)  |
Date: 2009-03-27 14:50 |
Indeed, I think I confused some other character encoding issues related to HTTP with the URI issue. The discussion in RFC 3986 is length and only occasionally clarifying for this issue. That is, it doesn't say anything definitive like applications are free to use any character encoding when decoding a URI. But I think it agrees with your assessment that an application is free to interpret the binary data however it wants, e.g. http://tools.ietf.org/html/rfc3986#section-2.1 |
|
|
msg89416 - (view) |
Author: Miles Kaufmann (milesck) |
Date: 2009-06-15 21:50 |
parse_qs and parse_qsl should also grow encoding and errors parameters to pass to the underlying unquote(). |
|
|
msg92029 - (view) |
Author: Miles Kaufmann (milesck) |
Date: 2009-08-28 09:38 |
I've attached a patch that provides similar functionality to Dan Mahn's urlencode(), as well as providing encoding and errors parameters to parse_qs and parse_qsl, updating the documentation to reflect the added parameters, and adding test cases. The implementation of urlencode() is not the same as dmahn's, and has a more straightforward control flow and less code duplication than the current implementation. (For the tests, I tried to match the style of the file I was adding to with regard to (expect, result) order, which is why it's inconsistent.) |
|
|
msg108290 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2010-06-21 17:23 |
The question of whether % escape should be limited to utf-8 or not was discussed and decided in favor of 'not' in #3300, quote and unquote. Last December, a websig post (referenced yesterday on pydev) reported a 'problem' that would be solved by Miles' suggestion to include parse_qs and parse_qsl. |
|
|
msg109101 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2010-07-02 11:45 |
I see no problem in going ahead with the suggestion proposed and the patch. - I checked with RFC3986 Section 2.5 http://labs.apache.org/webarch/uri/rfc/rfc3986.html#identifying-data Relevant line: When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding [STD63]; then only those octets that do not correspond to characters in the unreserved set should be percent-encoded. - This is done already in quote and quote_plus. - It just boils down to urlencode also providing the same facility for query strings and that was the point of this bug report. Jeremy, I shall go ahead with this and do the modifications, if required. |
|
|
msg109187 - (view) |
Author: Senthil Kumaran (orsenthil) *  |
Date: 2010-07-03 17:59 |
Fixed and Committed revision 82510 (py3k) and revision 82511 (release31-maint). This fixes urlencode issue. parse_qs and parse_qsl can have the same capabilities. It will be done subsequently (in another commit or issue) Thanks Dan for the bug report and patch. |
|
|