[Python-Dev] urllib.quote and unquote - Unicode issues (original) (raw)
Bill Janssen janssen at parc.com
Mon Jul 14 19:39:42 CEST 2008
- Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Next message: [Python-Dev] xmlrpclib.{True, False} (was Re: Assignment to None)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Clearly the unquote is str->bytes, You can't pass a Unicode string back as the result of unquote without passing in an encoding specifier, because the character set is application-specific. So for unquote you're suggesting that it always return a bytes object UNLESS an encoding is specified? As in: >> urllib.parse.unquote('h%C3%BCllo') b'h\xc3\xbcllo'
Yes, that's correct. That's what the RFC says we have to do.
I would object to that on two grounds. Firstly, I wouldn't expect or desire a bytes object. The vast majority of uses for unquote will be to get a character string out, not bytes. Secondly, there is a mountain of code (including about 12 modules in the standard library) which call unquote and don't give the user the encoding option, so it's best if we pick a default that is what the majority of users will expect. I argue that that's UTF-8.
Unfortunately, despite your expectations or desires, the spec doesn't allow us that luxury. It's bytes out, and they may even be in a non-standard (not registered with IANA) encoding. There's no way to safely and correctly turn that sequence of bytes into a string. If other modules have been mis-using the interface, they are buggy and should be fixed. There's a lot of buggy stdlib code in Python around the older Web standards.
I think it would be great to have another function, unquote_to_string, which took an extra "encoding" parameter, and returned a string. It would also be OK to add a keyword parameter to "unquote", I think, which provides an encoding, and causes unquote to return a string. But the standard behavior has to be to return bytes.
I'd prefer having a separate unquoteraw function which is str->bytes, and the unquote function performs the same role as it always have, which is str->str.
Actually, it was originally bytes->bytes, because there was no notion of Unicode strings when it was added. It perhaps got misunderstood during the addition of Unicode support to Python; many people have had trouble wrapping their heads around all this, myself included.
Bill
- Previous message: [Python-Dev] urllib.quote and unquote - Unicode issues
- Next message: [Python-Dev] xmlrpclib.{True, False} (was Re: Assignment to None)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]