[Python-Dev] Python in Unicode context (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Thu Aug 5 09:21:37 CEST 2004

Previous message: [Python-Dev] Python in Unicode context
Next message: [Python-Dev] Python in Unicode context
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

François Pinard wrote:

However, and I shall have the honesty to state it, this is not respectful of the general Unicode spirit: the Python implementation allows for independently addressable surrogate halves

This is only a problem if you have data which require surrogates (which I claim are rather uncommon at the moment), and you don't have a UCS-4 build of Python (in which surrogates don't exist). As more users demand convenient support for non-BMP characters, you'll find that more builds of Python become UCS-4. In fact, you might find that the build you are using already has sys.maxunicode > 65535.

combining zero-width diacritics

Indeed. However, it is not clear to me how this problem could be addressed, and I'm not aware of any API (any language) that addresses it.

Typically, people need things like this:

in a fixed-width terminal, what characters occupy what column. Notice that this involves East-Asian wide characters, where a single Unicode character (a "wide" character) occupies two columns. OTOH, with combining characters, a sequence of characters might be associated with a single column. Furthermore, some code points might not be associated with a column at all.
for a given font, how many points does a string occupy, horizontally and vertically.
where is the next word break

I don't know what your application is, but I somewhat doubt it is as simple as "give me a thing describing the nth character, including combining diacritics".

However, it is certainly possible to implement libraries on top of the existing code, and if there is a real need for that, somebody will contribute it.

normal and decomposed forms,

Terminology alert: the are multiple normal forms in Unicode, and some of them are decomposed (e.g. NFD, NFKD).

I fail to see a problem with that. There are applications for all normal forms, and many applications don't need the overhead of normalization. It might be that the code for your languages becomes simpler when always assuming NFC, but this hardly holds for all languages, or all applications.

directional marks, linguistic marks and various other such complexities.

Same comment as above: if this becomes a real problem, people will contribute code to deal with it.

But in our case, where applications already work in Latin-1, abusing our Unicode luck, UTF-8 may not be used as is, we ought to use Unicode or wide strings as well, for preserving S[N] addressability. So changing source encodings may be intimately tied to going Unicode whenever UTF-8 (or any other variable-length encoding) gets into the picture.

Yes. There is not much Python can do about this. UTF-8 is very nice for transfer of character data, but it does have most of the problems of a multi-byte encoding. I still prefer it over UTF-16 or UTF-32 for transfer, though.

I hope that my explanation above helps at seeing that source encoding and choice of string literals are not as independent as one may think.

It really depends on your processing needs. But yes, my advise still stands: convert to Unicode objects as early as possible in the processing. For source code involving non-ASCII characters, this means you really should use Unicode literals.

Of course, my other advise also applies: if you have a program that deals with multiple languages, use only ASCII in the source, and use gettext for the messages.

There ought to be a way to maintain a single Python source that would work dependably through re-encoding of the source, but not uselessly relying on wide strings when there is no need for them. That is, without marking all literal strings as being Unicode. Changing encoding from ISO 8859-1 to UTF-8 should not be a one-way, no-return ticket.

But it is not: as you say, you have to add u prefixes when going to UTF-8, yes. But then you can go back to Latin-1, with no change other than recoding, and changing the encoding declaration. The string literals can all stay as Unicode literals - the conversion to Latin-1 then really has no effect on the runtime semantics.

Of course, it is very normal that sources may have to be adapted for the possibility of a Unicode context. There should be some good style and habits for writing re-encodable programs. So this exchange of thoughts.

If that is the goal, you really need Unicode literals - everything else will break under re-encoding.

Regards, Martin

Previous message: [Python-Dev] Python in Unicode context
Next message: [Python-Dev] Python in Unicode context
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list