> What are the cases you believe will cause new mojibake?

Calling operations like urlsplit on byte sequences in non-ASCII
compatible encodings and operations like urljoin on byte sequences
that are encoded with different encodings. These errors differ from
the URL escaping errors you cite, since they can produce true mojibake
(i.e. a byte sequence without a single consistent encoding), rather
than merely non-compliant URLs. However, if someone has let their
encodings get that badly out of whack in URL manipulation they're
probably doomed anyway...

FWIW, while I understand the problems non-ASCII-compatible encodings can create, I've never encountered them, perhaps because ASCII-compatible encodings are so dominant.
">

(original) (raw)

On Mon, Sep 20, 2010 at 6:19 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:

> What are the cases you believe will cause new mojibake?

Calling operations like urlsplit on byte sequences in non-ASCII
compatible encodings and operations like urljoin on byte sequences
that are encoded with different encodings. These errors differ from
the URL escaping errors you cite, since they can produce true mojibake
(i.e. a byte sequence without a single consistent encoding), rather
than merely non-compliant URLs. However, if someone has let their
encodings get that badly out of whack in URL manipulation they're
probably doomed anyway...

FWIW, while I understand the problems non-ASCII-compatible encodings can create, I've never encountered them, perhaps because ASCII-compatible encodings are so dominant.


There are ways you can get a URL (HTTP specifically) where there is no notion of Unicode. I think the use case everyone has in mind here is where you get a URL from one of these sources, and you want to handle it. I have a hard time imagining the sequence of events that would lead to mojibake. Naive parsing of a document in bytes couldn't do it, because if you have a non-ASCII-compatible document your ASCII-based parsing will also fail (e.g., looking for b'href="(.*?)"'). I suppose if you did urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())) you could end up with the problem.


All this is unrelated to the question, though -- a separate byte-oriented function won't help any case I can think of. If the programmer is implementing something like urlparse.urlsplit(user_input.encode(sys.getdefaultencoding())), it's because they *want* to get bytes out. So if it's named urlparse.urlsplit_bytes() they'll just use that, with the same corruption. Since bytes and text don't interact well, the choice of bytes in and bytes out will be a deliberate one. *Or*, bytes will unintentionally come through, but that will just delay the error a while when the bytes out don't work (e.g., urlparse.urljoin(text_url, urlparse.urlsplit(byte_url).path). Delaying the error is a little annoying, but a delayed error doesn't lead to mojibake.


Mojibake is caused by allowing bytes and text to intermix, and the polymorphic functions as proposed don't add new dangers in that regard.

--
Ian Bicking | http://blog.ianbicking.org