[Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices) (original) (raw)

Ian Bicking ianb at colorstudy.com
Tue Sep 21 17:57:11 CEST 2010

Previous message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Sep 20, 2010 at 6:19 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

> What are the cases you believe will cause new mojibake?

Calling operations like urlsplit on byte sequences in non-ASCII compatible encodings and operations like urljoin on byte sequences that are encoded with different encodings. These errors differ from the URL escaping errors you cite, since they can produce true mojibake (i.e. a byte sequence without a single consistent encoding), rather than merely non-compliant URLs. However, if someone has let their encodings get that badly out of whack in URL manipulation they're probably doomed anyway...

FWIW, while I understand the problems non-ASCII-compatible encodings can create, I've never encountered them, perhaps because ASCII-compatible encodings are so dominant.

There are ways you can get a URL (HTTP specifically) where there is no notion of Unicode. I think the use case everyone has in mind here is where you get a URL from one of these sources, and you want to handle it. I have a hard time imagining the sequence of events that would lead to mojibake. Naive parsing of a document in bytes couldn't do it, because if you have a non-ASCII-compatible document your ASCII-based parsing will also fail (e.g., looking for b'href="(.*?)"'). I suppose if you did urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())) you could end up with the problem.

All this is unrelated to the question, though -- a separate byte-oriented function won't help any case I can think of. If the programmer is implementing something like urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())), it's because they want to get bytes out. So if it's named urlparse.urlsplit_bytes() they'll just use that, with the same corruption. Since bytes and text don't interact well, the choice of bytes in and bytes out will be a deliberate one. Or, bytes will unintentionally come through, but that will just delay the error a while when the bytes out don't work (e.g., urlparse.urljoin(text_url, urlparse.urlsplit(byte_url).path). Delaying the error is a little annoying, but a delayed error doesn't lead to mojibake.

Mojibake is caused by allowing bytes and text to intermix, and the polymorphic functions as proposed don't add new dangers in that regard.

-- Ian Bicking | http://blog.ianbicking.org -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20100921/a07f73ad/attachment-0001.html>

Previous message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Next message: [Python-Dev] Supporting raw bytes data in urllib.parse.* (was Re: Polymorphic best practices)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list