[Python-Dev] urllib unicode handling (original) (raw)
Tom Pinckney thomaspinckney3 at gmail.com
Wed May 7 15:19:41 CEST 2008
- Previous message: [Python-Dev] urllib unicode handling
- Next message: [Python-Dev] urllib unicode handling
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I may be missing something, but it seems that RFC 3987 (which is about
IRIs) basically says:
- IRIs are identical to URIs except they may have unicode characters
in them - IRIs must be converted to URIs before being used in HTTP
- The way to convert IRIs to URIs is to UTF-8 encode the unicode
characters in the IRI and then percent encode the resulting octects
that are unsafe to have in a URI - There's some ambiguity over what to do with the hostname portion of
the URI if it hash one (IDN, replace non-ascii characters with dashes
etc)
If this is indeed the case, it sounds perfectly legal (according to
the RFC) and perfectly practical (as required by numerous popular
websites) to have urllib.quote and urllib.quote_plus do an automatic
UTF-8 encoding of unicode strings before percent encoding them.
It's not entirely clear to me if people should be calling urllib.quote
on hostnames and expecting them to be encoded properly if the hostname
contains non-ascii characters. Perhaps the docs should be clarified on
this matter?
Similarly, urllib.unquote should precent-decode characters and then
attempt to convert the resulting octects from utf-8 to unicode. If
that conversion fails, we can assume the octects should be returned as
a byte string rather than a unicode string.
On May 7, 2008, at 8:12 AM, Armin Ronacher wrote:
Hi,
Jeroen Ruigrok van der Werven <asmodai in-nomine.org> writes:
Would people object if such functionality got added to urllib? I would ;-) There are IRIs, just that nobody wrote a useful module for that. There are algorithms in the RFC that can convert URIs to IRIs and the other way round. IMO that's the way to go. Regards, Armin
Python-Dev mailing list Python-Dev at python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/thomaspinckney3%40gmail.com
- Previous message: [Python-Dev] urllib unicode handling
- Next message: [Python-Dev] urllib unicode handling
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]