[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)

Chris McDonough chrism at plope.com
Tue Sep 21 00:28:44 CEST 2010


On Tue, 2010-09-21 at 08:19 +1000, Nick Coghlan wrote:

On Tue, Sep 21, 2010 at 7:39 AM, Chris McDonough <chrism at plope.com> wrote: > On Tue, 2010-09-21 at 07:12 +1000, Nick Coghlan wrote: >> On Tue, Sep 21, 2010 at 4:30 AM, Chris McDonough <chrism at plope.com> wrote: >> > Existing APIs save for "quote" don't really need to deal with charset >> > encodings at all, at least on any level that Python needs to care about. >> > The potential already exists to emit garbage which will turn into >> > mojibake from almost all existing APIs. The only remaining issue seems >> > to be fear of making a design mistake while designing APIs. >> > >> > IMO, having a separate module for all urllib.parse APIs, each designed >> > for only bytes input is a design mistake greater than any mistake that >> > could be made by allowing for both bytes and str input to existing APIs >> > and returning whatever type was passed. The existence of such a module >> > will make it more difficult to maintain a codebase which straddles >> > Python 2 and Python 3. >> >> Failure to use quote/unquote correctly is a completely different >> problem from using bytes with an ASCII incompatible encoding, or >> mixing bytes with different encodings. Yes, if you don't quote your >> URLs you may end up with mojibake. That's not a justification for >> creating a new way to accidentally create mojibake. > > There's no new way to accidentally create new mojibake here by allowing > bytes input, as far as I can tell. > > - If a user passes something that has character data outside the range > 0-127 to an API that expects a URL or a "component" (in the > definition that urllib.parse.urlparse uses for "component") of a URI, > he can keep both pieces when it breaks. Whether that data is > represented via bytes or text is not relevant. He provided > bad input, he is going to lose one way or another. > > - If a user passes a bytestring to quote, because quote is > implemented in terms of quotetobytes the case is already > handled by quotetobytes implicitly failing to convert nonascii > characters. > > What are the cases you believe will cause new mojibake?

Calling operations like urlsplit on byte sequences in non-ASCII compatible encodings and operations like urljoin on byte sequences that are encoded with different encodings. These errors differ from the URL escaping errors you cite, since they can produce true mojibake (i.e. a byte sequence without a single consistent encoding), rather than merely non-compliant URLs. However, if someone has let their encodings get that badly out of whack in URL manipulation they're probably doomed anyway...

Right, the bytes issue here is really a red herring in both the urlsplit and urljoin cases, I think.

It's certainly possible I hadn't given enough weight to the practical issues associated with migration of existing code from 2.x to 3.x (particularly with the precedent of some degree of polymorphism being set back when Issue 3300 was dealt with).

Given that a separate API still places the onus on the developer to manage their encodings correctly, I'm beginning to lean back towards the idea of a polymorphic API rather than separate functions. (the quote/unquote legacy becomes somewhat unfortunate in that situation, as they always returns str objects rather than allowing the type of the result to be determined by the type of the argument. Something like quotep/unquotep may prove necessary in order to work around that situation and provide a bytes->bytes, str->str API)

Yay, sounds much, much better!



More information about the Python-Dev mailing list