[Python-Dev] bytes (original) (raw)

[Python-Dev] bytes / unicode

Stephen J. Turnbull stephen at xemacs.org
Mon Jun 21 18:08:53 CEST 2010

Previous message: [Python-Dev] bytes / unicode
Next message: [Python-Dev] bytes / unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Lennart Regebro writes:

2010/6/21 Stephen J. Turnbull <stephen at xemacs.org>:

IMO, the UI is right. "Something" like the above "ought" to work.

Right. That said, many times when you want to do urlparse etc they might be binary, and you might want binary. So maybe the methods should work with both?

First, a caveat: I'm a Unicode/encodings person, not an experienced web programmer. My opinions on whether this would work well in practice should be taken with a grain of salt.

Speaking for myself, I live in a country where the natives have saddled themselves with no less than 4 encodings in common use, and I would never want "binary" since none of them would display as anything useful in a traceback. Wherever possible, I decode "blobs" into structured objects, I do it as soon as possible, and if for efficiency reasons I want to do this lazily, I store the blob in a separate .raw_object attribute. If they're textual, I decode them to text. I can't see an efficiency argument for decoding URIs lazily in most applications.

In the case of structured text like URIs, I would create a separate class for handling them with string-like operations. Internally, all text would be raw Unicode (ie, not url-encoded); repr(uri) would use some kind of readable quoting convention (not url-encoding) to disambiguate random reserved characters from separators, while str(uri) would produce an url-encoded string. Converting to and from wire format is just .encode and .decode, then, and in this country you need to be flexible about which encoding you use.

Agreed, this stuff is really annoying. But I think that just comes with the territory. PJE reports that folks don't like doing encoding and decoding all over the place. I understand that, but if they're doing a lot of that, I have to wonder why. Why not define the one line function and get on with life?

The thing is, where I live, it's not going to be a one line function. I'm going to be dealing with URLs that are url-encoded representations of UTF-8, Shift-JIS, EUC-JP, and occasionally RFC 2047! So I need an API that explicitly encodes and decodes. And I need an API that presents Japanese as Japanese rather than as line noise.

Eg, PJE writes

Ugh.  I meant: 

newurl = urljoin(str(base, 'latin-1'), 'subdir').encode('latin-1')

Which just goes to the point of how ridiculous it is to have to  
convert things to strings and back again to use APIs that ought to  
just handle bytes properly in the first place.

But if you need that "everywhere", what's so hard about

def urljoin_wrapper (base, subdir): return urljoin(str(base, 'latin-1'), subdir).encode('latin-1')

Now, note how that pattern fails as soon as you want to use non-ISO-8859-1 languages for subdir names. In Python 3, the code above is just plain buggy, IMHO. The original author probably will never need the generalization. But her name will be cursed unto the nth generation by people who use her code on a different continent.

The net result is that bytes are not a programmer- or user-friendly way to do this, except for the minority of the world for whom Latin-1 is a good approximation to their daily-use unibyte encoding (eg, it's probably usable for debugging in Dansk, but you won't win any popularity contests in Tel Aviv or Shanghai).

Previous message: [Python-Dev] bytes / unicode
Next message: [Python-Dev] bytes / unicode
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list