(original) (raw)

On Tue, Jun 22, 2010 at 6:31 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:

Toshio Kuratomi writes:

> I'll definitely buy that. Would urljoin(b\_base, b\_subdir) => bytes and
> urljoin(u\_base, u\_subdir) => unicode be acceptable though?

Probably.

But it doesn't matter what I say, since Guido has defined that as
"polymorphism" and approved it in principle.

> (I think, given other options, I'd rather see two separate
> functions, though.

Yes.

> If you want to deal with things like this::
> http://host/café

Yes.

Just for perspective, I don't know if I've ever wanted to deal with a URL like that. I know how it is supposed to work, and I know what a browser does with that, but so many tools will clean that URL up \*or\* won't be able to deal with it at all that it's not something I'll be passing around. So from a practical point of view this really doesn't come up, and if it did it would be in a situation where you could easily do something ad hoc (though there is not currently a routine to quote unsafe characters in a URL... that would be helpful, though maybe urllib.quote(url.encode('utf8'), '%/:') would do it).

Also while it is problematic to treat the URL-unquoted value as text (because it has an unknown encoding, no encoding, or regularly a mixture of encodings), the URL-quoted value is pretty easy to pass around, and normalization (in this case to http://host/caf%C3%A9) is generally fine.

While it's nice to be correct about encodings, sometimes it is impractical. And it is far nicer to avoid the situation entirely. That is, decoding content you don't care about isn't just inefficient, it's complicated and can introduce errors. The encoding of the underlying bytes of a %-decoded URL is largely uninteresting. Browsers (whose behavior drives a lot of convention) don't touch any of that encoding except lately occasionally to *display* some data in a more friendly way. But it's only display, and errors just make it revert to the old encoded display.

Similarly I'd expect (from experience) that a programmer using Python to want to take the same approach, sticking with unencoded data in nearly all situations.

--
Ian Bicking | http://blog.ianbicking.org