[Python-Dev] Unicode compromise? (original) (raw)

Guido van Rossum guido@python.org
Tue, 02 May 2000 16:47:30 -0400


I could live with this compromise as long as we document that a future version may use the "character is a character" model. I just don't want people to start depending on a catchable exception being thrown because that would stop us from ever unifying unmarked literal strings and Unicode strings.

Agreed (as I've said before).

--

Are there any steps we could take to make a future divorce of strings and byte arrays easier? What if we added a binaryread() function that returns some form of byte array. The byte array type could be just like today's string type except that its type object would be distinct, it wouldn't have as many string-ish methods and it wouldn't have any auto-conversion to Unicode at all.

You can do this now with the array module, although clumsily:

import array f = open("/core", "rb") a = array.array('B', [0]) * 1000 f.readinto(a) 1000

Or if you wanted to read raw Unicode (UTF-16):

a = array.array('H', [0]) * 1000 f.readinto(a) 2000 u = unicode(a, "utf-16")

There are some performance issues, e.g. you have to initialize the buffer somehow and that seems a bit wasteful.

People could start to transition code that reads non-ASCII data to the new function. We could put big warning labels on read() to state that it might not always be able to read data that is not in some small set of recognized encodings (probably UTF-8 and UTF-16).

Or perhaps binaryopen(). Or perhaps both. I do not suggest just using the text/binary flag on the existing open function because we cannot immediately change its behavior without breaking code.

A new method makes most sense -- there are definitely situations where you want to read in text mode for a while and then switch to binary mode (e.g. HTTP).

I'd like to put this off until after Python 1.6 -- but it deserves attention.

--Guido van Rossum (home page: http://www.python.org/~guido/)