Issue 4773: HTTPMessage not documented and has inconsistent API across Py2/Py3 (original) (raw)

A file-like object u returned by the urlopen() function in both Python 2.6/3.0 has a method info() that returns a 'HTTPMessage' object. For example:

::: Python 2.6

from urllib2 import urlopen u = urlopen("http://www.python.org") u.info() <httplib.HTTPMessage instance at 0xce5738>

::: Python 3.0

from urllib.request import urlopen u = urlopen("http://www.python.org") u.info() <http.client.HTTPMessage object at 0x4bfa10>

So far, so good. HTTPMessage is defined in two different modules, but that's fine (it's just library reorganization).

Two major problems:

  1. There is no documentation whatsoever on HTTPMessage. No description in the docs for httplib (python 2.6) or http.client (python 3.0).

  2. The HTTPMessage object in Python 2.6 derives from mimetools.Message and has a totally different programming interface than HTTPMessage in Python 3.0 which derives from email.message.Message. Check it out:

:::Python 2.6

dir(u.info()) ['contains', 'delitem', 'doc', 'getitem', 'init', 'iter', 'len', 'module', 'setitem', 'str', 'addcontinue', 'addheader', 'dict', 'encodingheader', 'fp', 'get', 'getaddr', 'getaddrlist', 'getallmatchingheaders', 'getdate', 'getdate_tz', 'getencoding', 'getfirstmatchingheader', 'getheader', 'getheaders', 'getmaintype', 'getparam', 'getparamnames', 'getplist', 'getrawheader', 'getsubtype', 'gettype', 'has_key', 'headers', 'iscomment', 'isheader', 'islast', 'items', 'keys', 'maintype', 'parseplist', 'parsetype', 'plist', 'plisttext', 'readheaders', 'rewindbody', 'seekable', 'setdefault', 'startofbody', 'startofheaders', 'status', 'subtype', 'type', 'typeheader', 'unixfrom', 'values']

:::Python 3.0

dir(u.info()) ['class', 'contains', 'delattr', 'delitem', 'dict', 'doc', 'eq', 'format', 'ge', 'getattribute', 'getitem', 'gt', 'hash', 'init', 'iter', 'le', 'len', 'lt', 'module', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'setitem', 'sizeof', 'str', 'subclasshook', 'weakref', '_charset', '_default_type', '_get_params_preserve', '_headers', '_payload', '_unixfrom', 'add_header', 'as_string', 'attach', 'defects', 'del_param', 'epilogue', 'get', 'get_all', 'get_boundary', 'get_charset', 'get_charsets', 'get_content_charset', 'get_content_maintype', 'get_content_subtype', 'get_content_type', 'get_default_type', 'get_filename', 'get_param', 'get_params', 'get_payload', 'get_unixfrom', 'getallmatchingheaders', 'is_multipart', 'items', 'keys', 'preamble', 'replace_header', 'set_boundary', 'set_charset', 'set_default_type', 'set_param', 'set_payload', 'set_type', 'set_unixfrom', 'values', 'walk']

I know that getting rid of mimetools was desired, but I have no idea if changing the API on HTTPMessage was intended or not. In any case, it's one of the only cases in the entire library where the programming interface to an object radically changes from 2.6 -> 3.0.

I ran into this problem with code that was trying to properly determine the charset encoding of the byte string returned by urlopen().

I haven't checked whether 2to3 deals with this or not, but it might be something for someone to look at in their copious amounts of spare time.

There is a difference in what HTTPResponse.getheaders() returns.

Python 2.7.2 (default, Jun 12 2011, 14:24:46) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

import httplib c = httplib.HTTPConnection('www.joelverhagen.com') c.request('GET', '/sandbox/tests/cookies.php') c.getresponse().getheaders() [('content-length', '0'), ('set-cookie', 'test_cookie1=foobar; expires=Fri, 02-Mar-2012 16:54:15 GMT, test_cookie2=barfoo; expires=Fri, 02-Mar-2012 16:54:15 GMT'), ('vary', 'Accept-Encoding'), ('server', 'Apache'), ('date', 'Fri, 02 Mar 2012 16:53:15 GMT'), ('content-type', 'text/html')]

Python 3.2.2 (default, Sep 4 2011, 09:07:29) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

from http import client c = client.HTTPConnection('www.joelverhagen.com') c.request('GET', '/sandbox/tests/cookies.php') c.getresponse().getheaders() [('Date', 'Fri, 02 Mar 2012 16:56:40 GMT'), ('Server', 'Apache'), ('Set-Cookie', 'test_cookie1=foobar; expires=Fri, 02-Mar-2012 16:57:40 GMT'), ('Set-Cookie', 'test_cookie2=barfoo; expires=Fri, 02-Mar-2012 16:57:40 GMT'), ('Vary', 'Accept-Encoding'), ('Content-Length', '0'), ('Content-Type', 'text/html')]

As you can see, in 2.7.2 HTTPResponse.getheaders() in 2.7.2 joins headers with the same name by ", ". In 3.2.2, the headers are kept separate and two or more 2-tuples.

This causes problems if you convert the list of 2-tuples to a dict, because the keys collide (causing all but one of the values associated the non-unique keys to be overwritten). It looks like this problem is caused by using the email header parser (which keeps the keys and values as separate 2-tuples). In Python 2.7.2, the HTTPMessage.addheader(...) function does the comma-separating.

Is this API change intentional? Should HTTPResponse.getheaders() comma-separate the values like the HTTPResponse.getheader(...) function (in both 2.7.2 and 3.2.2)?

See also: https://github.com/shazow/urllib3/issues/3#issuecomment-3008415