Issue 1479977: Heavy revisions to urllib2 howto (original) (raw)

Lots of people have been complaining about lack of urllib2 docs (though I'm never quite sure what people are looking for, being too familiar with all the details), so a tutorial may well be a useful addition. I'm sure you'll understand that my brutal criticism :-) is intended to make it even more useful.

Michael: feel free to make further revisions, but unless you have major objections I suggest that this is checked in first, then we make any further changes after that by uploading patches on SF for review (I haven't stepped back and re-read it with a fresh mind, and no doubt would be useful for somebody to do that). Editing this took me quite a while, and if I can help it I don't want to go through too many revisions or argue about the details before anything gets fixed!-). I've taken the liberty of mentioning myself as a reviewer somewhere at the end of the document :-)

Important: I reformatted paragraphs to max 70 character width (it's conventional, and plain-text diffs are especially painful to read otherwise, though admittedly diffs are never great for paragraphs anyway... I hope emacs didn't muck up any ReST syntax). I've uploaded just that formatting change as reformatted.rst (which also removes trailing whitespace from all lines). This should be done in a separate initial commit of course. For this reason, I've uploaded the whole document for both reformatted (reformatted.rst) and edited versions (edited.rst) rather than using patches.

I've made all of the changes I discuss below, with the exception of the missing example of GET with urlencoded data that's really needed (search for XXX in the comments below) -- that should just need a few lines.

BTW, it would be a really fantastic idea to turn the whole document into a valid doctest (I know I'm myself almost incapable of writing correct examples unless I do something like that). All that would require of course is adding a few >>>s and ...s and running it through doctest.testfile until it stops complaining ;-)

Now a list explaining and justifying the changes I made:

Spelling / paragraph structure etc. fixes. I won't list these.

Most importantly, you seem a bit unsure who your audience is. For example, on headers -- you explain that "HTTP is based on requests and responses", but dive into User-Agent without actually mentioning what a header is. In my changes, I ended up adding brief explanations of the concepts for people new to or fuzzy about HTTP, but didn't go into details of implementation. For example, introducing the concept of "HTTP header", but not explaining how HTTP implements them "on the wire" (though in fact I think it would be a good thing to add one example that showed an HTTP request and pointed out the request line, the headers and the data, since that makes everything very concrete and easy to grasp for newbies).

Removed link to external howto on cookie handling. Despite the description ("How to handle cookies, when fetching web pages with Python."), this actually spends most of its time discussing what conditional imports are needed if you want to be maximally compatible across libraries and older versions of Python. While that is certainly useful for people who need that, I think this is rather obscure and distracting detail that seems out of place being referenced from the Python 2.5 documentation, even in a howto. Perhaps some general statement that further tutorials are available on your site? Referencing your basic auth tutorial seems fine.

You limit mention of urllib2.urlopen(url) to a footnote, and in the text of the tutorial itself, you say: """urllib2 mirrors this by having you form a Request""" . That's not true: a string URL is fine, as you explain in the footnote. That seems an innaccuracy with no obvious didactic payoff. In the footnote, you say:

"""You can fetch URLs directly with urlopen, without using a request object. It's more explicit, and therefore more Pythonic, to use urllib2.Request though. It also makes it easier to add headers to your request.

I find that bizarre! Why is urlopen(url) unpythonic?? On the contrary, using an extra object for no reason does seem unpythonic to me. I rewrote this a bit.

You needlessly assign the_url = "http:...", then request = Request(the_url) -- why not a single line? Where it's useful to do that (i.e. in the more complicated examples), I've s/the_url/url/, since I object to chaff like "the_" in variable names ;-)

Your discussion of Request implies that it only represents HTTP requests. Fixed that.

Use of the word "handle" to talk about response objects is unfortunate for two reasons: First, many objects in Python are "handles" in some sense ("object reference" semantics), so it's too vague to be a helpful name. Second, it's particularly unfortunate to use the word "handle" when urllib2 makes heavy use of "handler" objects that "handle" requests. The fact that methods on these handlers often return your "handles" only makes things more confusing! s/handle/response/

"""Sometimes you want to POST data to a CGI (Common Gateway Interface) [#]_ or other web application"""

It's clear to us old hands what you mean here, but in a tutorial at the level you seem to have picked we probably shouldn't expect the reader to have all these concepts straight, so being sloppy here is bad.

I rewrote this bit to try to address those points.

Re POST: """This is what your browser does when you fill in a FORM on the web"""

Thats needed qualifying: form submission can also result in a GET.

I added a bit on side-effects and GET/POST.

"""You may be mimicking a FORM submission, or transmitting data to your own application."""

This reads oddly to me. I know what you're getting at (forms are not part of HTTP), but surely if you are submitting form data you're not "mimicking" form submission, you are submitting a form. And in an English sentence the "or" reads as an "exclusive or"; with that in mind: In what sense does form submission not involve "transmitting data to your own application"? Reworded and s/FORM/HTML form/, since we're talking about the abstract thing rather than specifically about the HTML element.

"""In either case the data needs to be encoded for safe transmission over HTTP"""

Arbitrary binary data does not need to be URL-encoded. Rephrased.

"""The encoding is done using a function from the urllib library not from urllib2. ::"""

This is not true in general even for HTML forms. For example, HTML form file upload data is not encoded in this way. There are more obscure cases, too. Noted this.

The quoted User-Agent string was out-of-date. Fixed, noting that it changes with each minor Python version.

Headers / data : I added a bit of explanatory context to tell people what we're about to explain, and break up paragraphs / add sections to clarify the structure. Also explained the concept of "HTTP header", as I noted above.

XXX example needed on GET with urlencoded data (as it's written ATM, this would go immediately before the "Headers" section).

"""Coping With Errors"""

"Handling exceptions" seems more accurate. Not all HTTP status codes for which urllib2 raises an exception involve HTTP error responses. The text is also confused on this point, so I rewrote it.

Errors: I believe urlopen can still actually raise socket.error. This is a bug, but I haven't dared to submit a patch to fix it, fearing backwards-compatibility issues. I guess it should probably be documented :-( But I suppose we should discuss that in a separate tracker item, rather than adding it to your howto straight away.

You mention IOError. Without a motivating use case I don't know why you mention this. Since I'm not really sure what the use case for this subclassing was ever intended to be :-) I removed this example: feel free to add it back if you know of a use or can get Jeremy Hylton to explain it to you ;-)

Re URLError : you imply that the only reason for URLError to be raised is failure to connect to the server. This is often the cause, but certainly not always.

For HTTP status codes, you refer to a document that states "This is a historic document and is not accurate anymore". RFC 2616 is authoritative, and IMHO fairly readable on error codes. Removed the reference to the other document.

"""As of Python 2.5 a dictionary like this one has become part of urllib2."""

In fact, this was moved to httplib. The reference to "HTTPBaseServer" (sic) is interesting: I think the copy in httplib should be removed, since it's already there in BaseHTTPServer (albeit missing 306, but that is unused) -- would you mind filing a patch, Michael?

Your listing differed from BaseHTTPServer and from RFC 2616, so I replaced it with the BaseHTTPServer copy.

"""shows all the defined response codes"""

These are only those defined by RFC 2616 of course: other standards can and do define other response status codes (e.g. DAV). Clarified this.

"""When an error is raised the server responds by returning an http error code and an error page."""

This is sloppy: HTTP doesn't define "raising" an error, so it can't respond to one. Fixed.

httplib.HTTPMessage

Reworded to avoid impling it's always going to be this concrete class.

"""In versions of Python prior to 2.3.4 it wasn't safe to iterate over the object directly, so you should iterate over the list returned by msg.keys() instead."""

Is this appropriate advice in the 2.5 docs? I removed this (am I too harsh on this point?).

"""Openers and handlers are slightly esoteric parts of urllib2."""

I don't want to scare people off: they're easy to use (if not to write). Removed this.

I added a tiny bit more on what handlers do.

Changed the text to avoid implying that build_opener() is the only way to create openers.

Don't refer to opener in those typewriter-font ReST backticks, since that seems a little misleading: it's not a Python class name (unfortunately the class is named OpenerDirector, which rather clashes with the use of the name "opener" of course, but personally I'm with you in preferring "opener").

Wrote a bit more about opener construction.

Changed realm name to make it clear it may contain spaces.

Changed references to URI to URL in discussion of authentication -- seems an irrelevant and distracting distinction here.

I edited the basic auth description a little.

Comments conventionally come before code it refers to, not after. Fixed that, removed an over-obvious comment or two (even in docs, "create the handler" seems redundant if that's all it says), and the fixed the curious line breaks.

"""The only reason to explicitly supply these to build_opener (which chains handlers provided as a list), would be to change the order they appear in the chain."""

I don't know of a use case for that in the case of the handlers you list. Also, that doesn't actually work: handler ordering is determined by sorting. Removed this.

"""One thing not to get bitten by is that the top_level_url in the code above must not contain the protocol - the http:// part. So if the URL we are trying to access is"""

This is not correct usage (though I can see why it worked); removed it. Admittedly, urllib2 auth was the subject of a quite a few bug fixes recently (I seem to have just found yet another one five minutes ago, in fact :-( ), so the situation pre-2.5 was certainly messy. However, I advise against trying to document the old bugs! Note that I haven't given examples of "sub-URLs" since the RFC (2617) isn't clear to me on this point, and I haven't yet tested whether urllib2 gets it right according to de-facto standards (as defined by browsers, Apache, etc.) for "sub-URLs" of the one passed to .add_password(). It's on the list...

In your note explaining that HTTPS proxies are not supported, you use "caution" rather than "note", which conveys the strange implication to me that this lack of support is somehow a consequence of using your previous recipe for switching off proxy handling (or am I weird in reading it that way??). s/caution/note/

""".. [#] Possibly some of this tutorial will make it into the standard library docs for versions of Python after 2.4.1."""

Removed this.

Whew!