[Python-Dev] Re: [I18n-sig] Re: Unicode debate (original) (raw)

Toby Dickenson tdickenson@geminidataloggers.com
Tue, 02 May 2000 14:46:44 +0100


On Tue, 02 May 2000 08:31:55 -0400, Guido van Rossum <guido@python.org> wrote:

No automatic conversions between 8-bit "strings" and Unicode = strings. =20 If you want to turn UTF-8 into a Unicode string, say so. If you want to turn Latin-1 into a Unicode string, say so. If you want to turn ISO-2022-JP into a Unicode string, say so. Adding a Unicode string and an 8-bit "string" gives an exception. I'd accept this, with one change: mixing Unicode and 8-bit strings is okay when the 8-bit strings contain only ASCII (byte values 0 through 127). That does the right thing when the program is combining ASCII data (e.g. literals or data files) with Unicode and warns you when you are using characters for which the encoding matters. I believe that this is important because much existing code dealing with strings can in fact deal with Unicode just fine under these assumptions. (E.g. I needed only 4 changes to htmllib/sgmllib to make it deal with Unicode strings -- those changes were all getattr() and setattr() calls.) When comparing 8-bit and Unicode strings, the presence of non-ASCII bytes in either should make the comparison fail; when ordering is important, we can make an arbitrary choice e.g. "\377" < u"\200".

I assume 'fail' means 'non-equal', rather than 'raises an exception'?

Toby Dickenson tdickenson@geminidataloggers.com