[Python-Dev] urllib.quote and unquote - Unicode issues (original) (raw)
Matt Giuca matt.giuca at gmail.com
Thu Aug 7 12:37:39 CEST 2008
- Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Next message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Wow .. a lot of replies today!
On Thu, Aug 7, 2008 at 2:09 AM, "Martin v. Löwis" <martin at v.loewis.de>wrote:
It hasn't been given priority: There are currently 606 patches in the tracker, many fixing bugs of some sort. It's not clear (to me, at least) why this should be given priority over all the other things such as interpreter crashes.
Sorry ... when I said "it hasn't been given priority" I mean "it hasn't been given a priority" - as in, nobody's assigned a priority to it, whatever that priority should rightfully be.
We all agree it's a bug: no, I don't. I think it's a missing feature, at best, but I'm staying out of the discussion. As-is, urllib only supports ASCII in URLs, and that is fine for most purposes.
Seriously, Mr. L%C3%B6wis, that's a tremendously na%C3%AFve statement.
URLs are just not made for non-ASCII characters. Implement IRIs if you want non-ASCII characters; the rules are much clearer for these.
Python 3.0 fully supports Unicode. URIs support encoding of arbitrary characters (as of more recent revisions). The difference is that URIs may only consist of ASCII characters (even though they can encode Unicode characters), while IRIs may also consist of Unicode characters. It's our responsibility to implement URIs here ... IRIs are a separate issue.
Having said this, I'm pretty sure Martin can't be convinced, so I'll leave that alone.
On Thu, Aug 7, 2008 at 3:34 AM, M.-A. Lemburg <mal at egenix.com> wrote:
So unquote() should probably try to decode using UTF-8 first and then fall back to Latin-1 if that doesn't work.
That's an interesting proposal. I think I don't like it - for a user application that's a good policy. But for a programming language library, I think it should not do guesswork. It should use the encoding supplied, and have a single default. But I'd be interested to hear if anyone else wants this.
As-is, it passes 'replace' to the errors argument, so encoding errors get replaced by '�' characters.
OK I haven't looked at the review yet .. guess it's off to the tracker :)
Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20080807/7b20a8a8/attachment.htm>
- Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Next message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]