Issue 9679: unicode DNS names in urllib, urlopen (original) (raw)

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	baikie, berker.peksag, christian.heimes, cvrebert, demian.brecht, flox, gdamjan, loewis, nagle, ncoghlan, orsenthil, r.david.murray, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-08-25 07:39 by loewis, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
issue9679.patch	demian.brecht,2015-03-13 22:27	review

Messages (10)
msg114884 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-08-25 07:39
Copy of issue 1027206; support in the socket module was provided, but this request remains: Also other modules should support unicode hostnames. (httplib already does) but urllib and urllib2 don't.
msg114886 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2010-08-25 07:47
From : it's not clear to me what this request really means. It could mean that Python should support IRIs, but then, I'm not sure whether this support can be in urllib, or whether a separate library would be needed.
msg114899 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-08-25 13:08
There was a discussion about IRI on python-dev in the middle of a discussion about adding a coercable bytes type, but I can't find it. I believe the conclusion was that the best solution for IRI support was a new library that implements the full IRI spec. It is possible that we could just add IDNA support to urllib, but it isn't clear that that work would be worth it when what is really needed is full IRI support. See also , though my guess based on the python-dev discussion and my experience with email is that an IRI library will need to be carefully designed with the py3k bytes/string separation in mind.
msg162722 - (view)	Author: John Nagle (nagle)	Date: 2012-06-13 18:51
A "IRI library" is not needed to fix this problem. It's already fixed in the sockets library and the http library. We just need consistency in urllib2. urllib2 functions which take a "url" parameter should apply "encodings.idna.ToASCII" to each label of the domain name. urllib2 function which return a "url" value (such as "geturl()") should apply "encodings.idna.ToUnicode" to each label of the domain name. Note that in both cases, the conversion function must be applied to each label (field between "."s) of the domain name only. Applying it to the entire domain name or the entire URL will not work. If there are future changes to domain syntax, those should go into "encodings.idna", which is the proper library for domain syntax issues.
msg162723 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-06-13 19:10
I doubt that unicode domain support in urllib would be of much use without full IRI support. I would think that a domain that uses unicode is highly likely to have URLs that use unicode. However that doesn't mean a patch along the lines you suggest would be rejected out of hand, especially if someone can provide a real web site where it would be helpful.
msg162752 - (view)	Author: John Nagle (nagle)	Date: 2012-06-14 05:07
The current convention is that domains go into DNS lookup as punycode, and the port, query, and fragment fields of the URL are encoded with percent-escapes. See http://lists.w3.org/Archives/Public/ietf-http-wg/2011OctDec/0155.html Python needs to get with the program here.
msg162780 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-06-14 12:49
As I said, patches to improve the situation are welcome, and if they match with current internet practices they will likely be accepted. It is still the case that such URLs are likely to require extra work on the part of the application to deal with the other unicode parts (your linked reference reinforces that). So, IMO it would be better if someone would do an IRI module. But the fact that nobody has stepped up for that should not prevent us from improving the situation in other ways.
msg162974 - (view)	Author: Florent Xicluna (flox) *	Date: 2012-06-16 14:08
The werkzeug.urls module has examples of such conversion IRI-to-URI: https://github.com/mitsuhiko/werkzeug/blob/master/werkzeug/urls.py#L109,L205
msg237426 - (view)	Author: John Nagle (nagle)	Date: 2015-03-07 07:45
Three years later, I'm converting to Python 3. Did this get fixed in Python 3?
msg238060 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-13 22:27
Here's a simple patch that adds functionality matching that in http.client to urllib.request. As pointed out by John, I see no reason why urllib and http.client shouldn't have consistent handling if IDNs independent of IRIs (although IRI encoding would be a nice addition as well).

History
Date	User	Action	Args
2022-04-11 14:57:05	admin	set	github: 53888
2017-01-18 11:40:38	martin.panter	link	issue3991 dependencies
2015-03-13 22:32:09	berker.peksag	set	nosy: + berker.peksagversions: + Python 3.5, - Python 3.3, Python 3.4
2015-03-13 22:27:48	demian.brecht	set	stage: patch review
2015-03-13 22:27:27	demian.brecht	set	files: + issue9679.patchkeywords: + patchmessages: +
2015-03-13 10🔞44	demian.brecht	set	nosy: + demian.brecht
2015-03-07 07:45:53	nagle	set	messages: +
2013-07-05 23:02:39	christian.heimes	set	nosy: + christian.heimesversions: + Python 3.4
2012-06-16 14:08:23	flox	set	messages: +
2012-06-14 12:49:59	r.david.murray	set	messages: +
2012-06-14 05:07:21	nagle	set	messages: +
2012-06-13 20:19:50	cvrebert	set	nosy: + cvrebert
2012-06-13 19:10:14	r.david.murray	set	messages: + versions: + Python 3.3, - Python 3.2
2012-06-13 18:51:10	nagle	set	nosy: + naglemessages: +
2010-08-25 13:09:05	r.david.murray	set	keywords: - patch, buildbot
2010-08-25 13:08:40	r.david.murray	set	nosy: + r.david.murray, ncoghlanmessages: + stage: patch review -> (no value)
2010-08-25 07:47:53	loewis	set	messages: +
2010-08-25 07:44:42	loewis	link	issue1027206 superseder
2010-08-25 07:39:27	loewis	create