[Python-Dev] Python in Unicode context (original) (raw)

François Pinard pinard at iro.umontreal.ca
Tue Aug 3 21:35:16 CEST 2004


[Martin von Löwis]

François Pinard wrote:

> maybe some kind of module._coding_' next to module.file', >saving the coding effectively used while compilation was going on.

That would be possible to implement. Feel free to create a patch.

I might try, and it would be my first Python patch. But please, please tell me if the idea is not welcome, as my free time is rather short and I already have a lot of things waiting for me! :-).

>I wonder if some other cookie, next to the coding:' cookie, could_ _>not be used to declare that all strings in this module only should_ _>be interpreted as Unicode by default, but without the need of_ _>resorting to u' prefix all over.

[...] if you know a syntax which you like, propose a patch. Be prepared to also write a PEP defending that syntax.

Surely no particular syntax that I like enough for defending it. Anything reasonable would do as far as I am concerned, so I might propose a reasonable patch without involving myself into a crusade. Yet I may try to assemble and edit together the ideas of others, if it serves a purpose.

>Right now, my feeling is that Python asks a bit too much of a >programmer, in terms of commitment, if we only consider the editing >work required on sources to use it, or not.

Not sure what you are referring here to.

There is currently a lot of effort involved in Python so Unicode strings and usual strings inter-operate correctly and automatically, also hiding as much as reasonable to the unwilling user whether if characters are large or narrow: s/he uses about the same code no matter what. The way Python does is rather lovely, in fact. :-)

I'm going to transform a flurry of Latin-1 Python scripts to UTF-8, but not all of them, as I'm not going to impose Unicode in our team where it is not wanted. For French, and German and many others, we have been lucky enough for having one codepoint per character in Unicode, so we can hope that programs assuming that S[N] addresses the N'th (0-based) character of string S will work the same way irrelevant of if strings are narrow or wide. However, and I shall have the honesty to state it, this is not respectful of the general Unicode spirit: the Python implementation allows for independently addressable surrogate halves, combining zero-width diacritics, normal and decomposed forms, directional marks, linguistic marks and various other such complexities.

But in our case, where applications already work in Latin-1, abusing our Unicode luck, UTF-8 may not be used as is, we ought to use Unicode or wide strings as well, for preserving S[N] addressability. So changing source encodings may be intimately tied to going Unicode whenever UTF-8 (or any other variable-length encoding) gets into the picture.

You do have the choice of source encodings, and, in fact, "Unicode" is not a valid source encoding. "UTF-8" is [...]

Guess that I know! :-) :-)

[...] from a Python point of view, there is absolutely no difference between [UTF-8] and, say, "ISO-8859-15". Choice of source encoding is different from the choice of string literals. You can use Unicode strings, or byte strings, or mix them. It really is your choice.

I hope that my explanation above helps at seeing that source encoding and choice of string literals are not as independent as one may think. A choice that I surely do not have is to see bugs appear in programs merely because I changed the source encoding. Going from ISO 8859-1 to ISO 8859-15 for a Python source is probably fairly safe, because there is no need for switching the narrowness of strings. Going from ISO 8859-1 to UTF-8 is very unsafe, and editing all literal strings from narrow to wide, using `u' prefixes, becomes almost unavoidable.

There ought to be a way to maintain a single Python source that would work dependably through re-encoding of the source, but not uselessly relying on wide strings when there is no need for them. That is, without marking all literal strings as being Unicode. Changing encoding from ISO 8859-1 to UTF-8 should not be a one-way, no-return ticket.

Of course, it is very normal that sources may have to be adapted for the possibility of a Unicode context. There should be some good style and habits for writing re-encodable programs. So this exchange of thoughts.

-- François Pinard http://www.iro.umontreal.ca/~pinard



More information about the Python-Dev mailing list