msg114884 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-08-25 07:39 |
Copy of issue 1027206; support in the socket module was provided, but this request remains: Also other modules should support unicode hostnames. (httplib already does) but urllib and urllib2 don't. |
|
|
msg114886 - (view) |
Author: Martin v. Löwis (loewis) *  |
Date: 2010-08-25 07:47 |
From : it's not clear to me what this request really means. It could mean that Python should support IRIs, but then, I'm not sure whether this support can be in urllib, or whether a separate library would be needed. |
|
|
msg114899 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2010-08-25 13:08 |
There was a discussion about IRI on python-dev in the middle of a discussion about adding a coercable bytes type, but I can't find it. I believe the conclusion was that the best solution for IRI support was a new library that implements the full IRI spec. It is possible that we could just add IDNA support to urllib, but it isn't clear that that work would be worth it when what is really needed is full IRI support. See also , though my guess based on the python-dev discussion and my experience with email is that an IRI library will need to be carefully designed with the py3k bytes/string separation in mind. |
|
|
msg162722 - (view) |
Author: John Nagle (nagle) |
Date: 2012-06-13 18:51 |
A "IRI library" is not needed to fix this problem. It's already fixed in the sockets library and the http library. We just need consistency in urllib2. urllib2 functions which take a "url" parameter should apply "encodings.idna.ToASCII" to each label of the domain name. urllib2 function which return a "url" value (such as "geturl()") should apply "encodings.idna.ToUnicode" to each label of the domain name. Note that in both cases, the conversion function must be applied to each label (field between "."s) of the domain name only. Applying it to the entire domain name or the entire URL will not work. If there are future changes to domain syntax, those should go into "encodings.idna", which is the proper library for domain syntax issues. |
|
|
msg162723 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2012-06-13 19:10 |
I doubt that unicode domain support in urllib would be of much use without full IRI support. I would think that a domain that uses unicode is highly likely to have URLs that use unicode. However that doesn't mean a patch along the lines you suggest would be rejected out of hand, especially if someone can provide a real web site where it would be helpful. |
|
|
msg162752 - (view) |
Author: John Nagle (nagle) |
Date: 2012-06-14 05:07 |
The current convention is that domains go into DNS lookup as punycode, and the port, query, and fragment fields of the URL are encoded with percent-escapes. See http://lists.w3.org/Archives/Public/ietf-http-wg/2011OctDec/0155.html Python needs to get with the program here. |
|
|
msg162780 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2012-06-14 12:49 |
As I said, patches to improve the situation are welcome, and if they match with current internet practices they will likely be accepted. It is still the case that such URLs are likely to require extra work on the part of the application to deal with the other unicode parts (your linked reference reinforces that). So, IMO it would be *better* if someone would do an IRI module. But the fact that nobody has stepped up for that should not prevent us from improving the situation in other ways. |
|
|
msg162974 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2012-06-16 14:08 |
The werkzeug.urls module has examples of such conversion IRI-to-URI: https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/urls.py#L109,L205 |
|
|
msg237426 - (view) |
Author: John Nagle (nagle) |
Date: 2015-03-07 07:45 |
Three years later, I'm converting to Python 3. Did this get fixed in Python 3? |
|
|
msg238060 - (view) |
Author: Demian Brecht (demian.brecht) *  |
Date: 2015-03-13 22:27 |
Here's a simple patch that adds functionality matching that in http.client to urllib.request. As pointed out by John, I see no reason why urllib and http.client shouldn't have consistent handling if IDNs independent of IRIs (although IRI encoding would be a nice addition as well). |
|
|