Issue 216716: urllib.quote and Unicode (original) (raw)

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/33339

classification

Title: urllib.quote and Unicode
Type: enhancement Stage:
Components: Extension Modules Versions:

process

Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: doerwalter, gvanrossum, lemburg
Priority: normal Keywords:

Created on 2000-10-12 15:58 by doerwalter, last changed 2022-04-10 16:02 by admin. This issue is now closed.

Messages (3)
msg2032 - (view) Author: Walter Dörwald (doerwalter) * (Python committer) Date: 2000-10-12 15:58
Currently urllib.quote does not handle Unicode strings. urllib should be able to handle those. According to http://www.w3.org/TR/html4/appendix/notes.html#h-B.2.1 what is required is: 1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes. 2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value). urllib.quote already does 2. For Unicode strings it should do 1. too. This chances the meaning of urllib.quote slightly, now an 8bit string would be interpreted as being utf8 encoded. To fix this an 8bit string should be transcoded from the default encoding to utf8 first, i.e. what should be inserted at the beginning of quote is: if type(s) == types.StringType: s = unicode(s,sys.getdefaultencoding()) s = s.encode("utf8")
msg2033 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2000-10-12 16:13
Sure. Added to PEP-42. I have a feeling that there are probably a lot of places in the standard library where decisions like this may have to be made...! (Exercise for the reader: code this so that it works with JPython too...)
msg2034 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2000-10-12 17:04
FYI: To recode an 8-bit string using the default encoding into a string using a different encoding, you only have to call the .encode() method on the string object, e.g. >>> "abc".encode('utf-16') '\377\376a\000b\000c\000' >>> "abc".encode('utf-8') 'abc' But with ASCII as default encoding there's nothing much recode into UTF-8 anyway ;-) Still, the method makes writing polymorphic code which produces 8-bit strings as output a tad easier. Perhaps JPython's strings should have a .encode() method too...
History
Date User Action Args
2022-04-10 16:02:30 admin set github: 33339
2000-10-12 15:58:18 doerwalter create