[I18n-sig] Re: [Python-Dev] Unicode debate (original) (raw)

Paul Prescod paul@prescod.net
Mon, 01 May 2000 15:38:29 -0500

Previous message: [I18n-sig] Re: [Python-Dev] Unicode debate
Next message: [I18n-sig] Re: [Python-Dev] Unicode debate
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Uche asked for a summary so I cc:ed the xml-sig.

Guido van Rossum wrote:

... OK. I really meant recoding in UTF-8 -- I maintain that there are lots of forces that prevent recoding most ISO-2022-JP documents in UTF-8.

Absolutely agree.

Are you sure you understand what we are arguing about?

Here's what I thought we were arguing about:

If you put a bunch of "funny characters" into a Python string literal, and then compare that string literal against a Unicode object, should those funny characters be treated as logical units of text (characters) or as bytes? And if bytes, should some transformation be automatically performed to have those bytes be reinterpreted as characters according to some particular encoding scheme (probably UTF-8).

I claim that we should as far as possible treat strings as character lists and not add any new functionality that depends on them being byte list. Ideally, we could add a byte array type and start deprecating the use of strings in that manner. Yes, it will take a long time to fix this bug but that's what happens when good software lives a long time and the world changes around it.

Earlier, you quoted some reference documentation that defines 8-bit strings as containing characters. That's taken out of context -- this was written in a time when there was (for most people anyway) no difference between characters and bytes, and I really meant bytes.

Actually, I think that that was Fredrik.

Anyhow, you wrote the documentation that way because it was the most intuitive way of thinking about strings. It remains the most intuitive way. I think that that was the point Fredrik was trying to make.

We can't make "byte-list" strings go away soon but we can start moving people towards the "character-list" model. In concrete terms I would suggest that old fashioned lists be automatically coerced to Unicode by interpreting each byte as a Unicode character. Trying to go the other way could cause the moral equivalent of an OverflowError but that's not a problem.

a=1000000000000000000000000000000000000L int(a) Traceback (innermost last): File "", line 1, in ? OverflowError: long int too long to convert

And just as with ints and longs, we would expect to eventually unify strings and unicode strings (but not byte arrays).

-- Paul Prescod - ISOGEN Consulting Engineer speaking for himself It's difficult to extract sense from strings, but they're the only communication coin we can count on. - http://www.cs.yale.edu/~perlis-alan/quotes.html

Previous message: [I18n-sig] Re: [Python-Dev] Unicode debate
Next message: [I18n-sig] Re: [Python-Dev] Unicode debate
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]