[Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers (original) (raw)
Xavier Morel catch-all at masklinn.net
Sat Jan 4 17:50:23 CET 2014
- Previous message: [Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers
- Next message: [Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2014-01-04, at 17:24 , Chris Angelico <rosuav at gmail.com> wrote:
On Sun, Jan 5, 2014 at 2:36 AM, Hugo G. Fierro <hugo at gfierro.com> wrote:
I am trying to download an HTML document. I get an HTTP 301 (Moved Permanently) with a UTF-8 encoded Location header and http.client decodes it as iso-8859-1. When there's a non-ASCII character in the redirect URL then I can't download the document.
In client.py def parseheaders() I see the call to decode('iso-8859-1'). My personal hack is to use whatever charset is defined in the Content-Type HTTP header (utf8) or fall back into iso-8859-1. At this point I am not sure where/how a fix should occur so I thought I'd run it by you in case I should file a bug. Note that I don't use http.client directly, but through the python-requests library. I'm not 100% sure, but I believe non-ASCII characters are outright forbidden in a Location: header. It's possible that an RFC2047 tag might be used, but my reading of RFC2616 is that that's only for text fields, not for Location. These non-ASCII characters ought to be percent-encoded, and anything doing otherwise is buggy.
That is also my reading, the Location field’s value is defined as an absoluteURI (RFC2616, section 14.30):
Location = "Location" ":" absoluteURI
section 3.2.1 indicates that "absoluteURI" (and other related concepts) are used as defined by RFC 2396 "Uniform Resource Identifiers (URI): Generic Syntax", that is:
absoluteURI = scheme ":" ( hierpart | opaquepart )
both "hier_part" and "opaque_part" consist of some punctuation characters, "escaped" and "unreserved". "escaped" is %-encoded characters which leaves "unreserved" defined as "alphanum | mark". "mark" is more punctuation and "alphanum" is ASCII's alphanumeric ranges.
Furthermore, although RFC 3986 moves some stuff around and renames some production rules, it seems to have kept this limitation.
- Previous message: [Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers
- Next message: [Python-Dev] Bug? http.client assumes iso-8859-1 encoding of HTTP headers
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]