[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)

Chris McDonough chrism at plope.com
Mon Sep 20 23:39:13 CEST 2010

Previous message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 2010-09-21 at 07:12 +1000, Nick Coghlan wrote:

On Tue, Sep 21, 2010 at 4:30 AM, Chris McDonough <chrism at plope.com> wrote: > Existing APIs save for "quote" don't really need to deal with charset > encodings at all, at least on any level that Python needs to care about. > The potential already exists to emit garbage which will turn into > mojibake from almost all existing APIs. The only remaining issue seems > to be fear of making a design mistake while designing APIs. > > IMO, having a separate module for all urllib.parse APIs, each designed > for only bytes input is a design mistake greater than any mistake that > could be made by allowing for both bytes and str input to existing APIs > and returning whatever type was passed. The existence of such a module > will make it more difficult to maintain a codebase which straddles > Python 2 and Python 3.

Failure to use quote/unquote correctly is a completely different problem from using bytes with an ASCII incompatible encoding, or mixing bytes with different encodings. Yes, if you don't quote your URLs you may end up with mojibake. That's not a justification for creating a new way to accidentally create mojibake.

There's no new way to accidentally create new mojibake here by allowing bytes input, as far as I can tell.

If a user passes something that has character data outside the range 0-127 to an API that expects a URL or a "component" (in the definition that urllib.parse.urlparse uses for "component") of a URI, he can keep both pieces when it breaks. Whether that data is represented via bytes or text is not relevant. He provided bad input, he is going to lose one way or another.
If a user passes a bytestring to quote, because quote is implemented in terms of quote_to_bytes the case is already handled by quote_to_bytes implicitly failing to convert nonascii characters.

What are the cases you believe will cause new mojibake?

Separating the APIs means that application programmers will be expected to know whether they are working with data formatted for display to the user (i.e. Unicode text) or transfer over the wire (i.e. ASCII compatible bytes).

Can you give me a concrete use case where the application programmer won't know which format they're working with? Py3k made the conscious decision to stop allowing careless mixing of encoded and unencoded text. This is just taking that philosophy and propagating it further up the API stack (as has already been done with several OS facing APIs for 3.2).

Yes. Code which must explicitly deal with bytes input and output meant to straddle both Python 2 and Python 3. Please try to write some code which 1) uses the same codebase to straddle Python 2.6 and Python 3.2 and 2) which uses bytes input, and expects bytes output from, say, urlsplit. It becomes complex very quickly. A proposal to create yet another bytes-only API only makes it more complex, AFAICT.

Previous message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list