[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Sun Sep 19 04:03:03 CEST 2010

Previous message: [Python-Dev] os.path.normcase rationale?
Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, Sep 19, 2010 at 4:18 AM, John Nagle <nagle at animats.com> wrote:

On 9/18/2010 2:29 AM, python-dev-request at python.org wrote:

Polymorphic best practices [was: (Not) delaying the 3.2 release] If you're hung up on this, try writing the user-level documentation first. Your target audience is a working-level Web programmer, not someone who knows six programming languages and has a CS degree. If the explanation is too complex, so is the design. Coding in this area is quite hard to do right. There are issues with character set, HTML encoding, URL encoding, and internationalized domain names. It's often done wrong; I recently found a Google service which botched it. Python libraries should strive to deliver textual data to the programmer in clean Unicode. If someone needs the underlying wire representation it should be available, but not the default.

Even though URL byte sequences are defined as using only an ASCII subset, I'm currently inclined to add raw bytes supports to urlib.parse by providing parallel APIs (i.e. urlib.parse.urlsplitb, etc) rather than doing it implicitly in the normal functions.

My rationale is as follows:

while URLs are meant to be encoded correctly as an ASCII subset, the real world isn't always quite so tidy (i.e. applications treat as URLs things that technically are not because the encoding is wrong)
separating the APIs forces the programmer to declare that they know they're working with the raw bytes off the wire to avoid the decode/encode overhead that comes with working in the Unicode domain
easier to change our minds later. Adding implicit bytes support to the normal names can be done any time, but removing it would require an extensive deprecation period

Essentially, while I can see strong use cases for wanting to manipulate URLs in wire format, I don't see strong use cases for manipulating URLs without knowing whether they're in wire format (encoded bytes) or display format (Unicode text). For some APIs that work for arbitrary encodings (e.g. os.listdir) switching based on argument type seems like a reasonable idea. For those that may silently produce incorrect output for ASCII-incompatible encodings, the os.environ/os.environb seems like a better approach.

I could probably be persuaded to merge the APIs, but the email6 precedent suggests to me that separating the APIs better reflects the mental model we're trying to encourage in programmers manipulating text (i.e. the difference between the raw octet sequence and the text character sequence/parsed data).

Cheers, Nick.

-- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia

Previous message: [Python-Dev] os.path.normcase rationale?
Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list