[Python-Dev] len(chr(i)) = 2? (original) (raw)

James Y Knight foom at fuhm.net
Wed Nov 24 07:27:52 CET 2010


On Nov 24, 2010, at 12:07 AM, Stephen J. Turnbull wrote:

By the way, to send the ball back into your court, I have this feeling that the demand for UTF-8 is once again driven by native English speakers who are very shortly going to find themselves, and the data they are most familiar with, very much in the minority. Of course the market that benefits from UTF-8 compression will remain very large for the immediate future, but in the grand scheme of things, most of the world is going to prefer UTF-16 by a substantial margin.

No, the demand for UTF-8 is because that's what much of the internet (and not coincidentally, unix) world has standardized on. The main pieces of software using UTF-16 (Windows, Java) started doing so before it became apparent that 16 bits wasn't enough to actually hold a unicode codepoint, so they were actually implementing UCS-2. In those days, UCS-2 was a fairly sensible choice.

But, now, if your choices are UTF-8 or UTF-16, UTF-8 is clearly superior. Not because it's smaller -- it's pretty much a tossup -- but because it is an ASCII superset, and thus more easily compatible with other software. That also makes it most commonly used for internet communication. (So, there's a huge advantage for using it internally as well right there: no transcoding necessary for writing your HTML output). UTF-16 is incompatible with ASCII, and furthermore, it's still a variable-width encoding, with all the same issues that causes. As such, there's really very little to be said in favor of it.

If you really want a fixed-width encoding, you have to go to UTF-32, which is excessively large. UTF-32 is a losing choice, simply because of the wasted memory usage.

But that's all a side issue: even if you do choose UTF-16 as your underlying encoding, you still need to provide iterators that work by "byte" (only now bytes are 16-bits), by codepoint, and by grapheme. Of course, people who implement UTF-16 (such as python, java, and windows) often pretend they're still implementing UCS-2, and don't bother even providing their users with the necessary APIs to do things correctly. Which, you can often get away with...just so long as you don't mind that you sometimes end up splitting a string in the middle of a codepoint and causing a unicode error!

James



More information about the Python-Dev mailing list