[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Sun Sep 19 04:03:03 CEST 2010


On Sun, Sep 19, 2010 at 4:18 AM, John Nagle <nagle at animats.com> wrote:

On 9/18/2010 2:29 AM, python-dev-request at python.org wrote:

Polymorphic best practices [was: (Not) delaying the 3.2 release] If you're hung up on this, try writing the user-level documentation first.  Your target audience is a working-level Web programmer, not someone who knows six programming languages and has a CS degree. If the explanation is too complex, so is the design. Coding in this area is quite hard to do right.  There are issues with character set, HTML encoding, URL encoding, and internationalized domain names.  It's often done wrong; I recently found a Google service which botched it. Python libraries should strive to deliver textual data to the programmer in clean Unicode.  If someone needs the underlying wire representation it should be available, but not the default.

Even though URL byte sequences are defined as using only an ASCII subset, I'm currently inclined to add raw bytes supports to urlib.parse by providing parallel APIs (i.e. urlib.parse.urlsplitb, etc) rather than doing it implicitly in the normal functions.

My rationale is as follows:

Essentially, while I can see strong use cases for wanting to manipulate URLs in wire format, I don't see strong use cases for manipulating URLs without knowing whether they're in wire format (encoded bytes) or display format (Unicode text). For some APIs that work for arbitrary encodings (e.g. os.listdir) switching based on argument type seems like a reasonable idea. For those that may silently produce incorrect output for ASCII-incompatible encodings, the os.environ/os.environb seems like a better approach.

I could probably be persuaded to merge the APIs, but the email6 precedent suggests to me that separating the APIs better reflects the mental model we're trying to encourage in programmers manipulating text (i.e. the difference between the raw octet sequence and the text character sequence/parsed data).

Cheers, Nick.

-- Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia



More information about the Python-Dev mailing list