[Python-Dev] Printing and unicode (original) (raw)

Guido van Rossum guido@python.org
Thu, 14 Nov 2002 08:48:58 -0500


Martin v. Loewis wrote: > "M.-A. Lemburg" <mal@lemburg.com> writes: > > >>The fact that StringIO works with Unicode (and then only in the >>case where you only pass Unicode to it) is more an implementation >>detail than a true feature. > > It's a true feature. You explicitly fixed that feature in > > revision 1.20 > date: 2002/01/06 17:15:05; author: lemburg; state: Exp; lines: +8 -5 > Restore Python 2.1 StringIO.py behaviour: support concatenating > Unicode string snippets to larger Unicode strings. > > This fix should also go into Python 2.2.1. > > after you broke it in > > revision 1.19 > date: 2001/09/24 17:34:52; author: lemburg; state: Exp; lines: +4 -1 > branches: 1.19.12; > StringIO patch #462596: let's [c]StringIO accept read buffers on > input to .write() too.

I doubt that it's a true feature. The fact that I broke it in the above patch by introducing the str(data) call in StringIO.py suggests that whoever complained about this change was using an implementation detail rather than a documented and originally intended feature of StringIO. If you need something like StringIO for Unicode then I would suggest to create a similar object which then only deals with Unicode, e.g. UnicodeIO.

But since StringIO already works for Unicode, why bother?

cStringIO could then be extended to also support such an object by using the same trick as SRE does to support two native types (putting the code into a .h file and then including it twice).

(Off-topic: each time I fix a bug twice, once in stringobject.c and once in unicodeobject.c, I wish we'd done that for string and unicode objects. But it's too late now, and also may not be realistic given some different implementation choices.)

Back to the original question. I don't have a problem with leaving in the Unicode support in StringIO's .write() method, but the introduction of the Unicode print support should not rely on this detail.

Agreed.

Instead someone wanting to write Unicode only to a StringIO like object should be directed to UnicodeIO.

Now, to satisfy the request of the poster who wanted support for unicode in PyFileWriteObject() we need to add something which lets PyFileWriteObject() determine wether to look for unicode or not (per default, it passes through Unicode objects as-is and applies str() to all other objects). I like the idea of using the .encoding attribute as flag for this. What I don't like is that setting it to None should be used for Unicode-only streams (ones that take Unicode on input and use Unicode on output). To me, .encoding = None would signal: this stream doesn't do anything to the input data and passes it to the output stream as-is.

But I'm not sure that's a useful feature. Maybe encoding=None could mean the current StringIO behavior. <0.5 wink>

Much better, IMHO, would be to use .encoding = 'unicode' on Unicode-only streams such as the mentioned UnicodeIO object.

Yes. (Except 'unicode' is not an encoding name, right? Maybe it should be?)

In summary, StringIO objects should not implement .encoding while a new Unicode-only stream-like object UnicodeIO should have .encoding = 'unicode'.

The same could then be done with the corresponding cStringIO objects. PS: Some may not know, but the obvious way of fixing printing of Unicode by adding a tpprint slot implementation does not work, since that slot takes a FILE* pointer as file "object" which, of course, cannot include any additional information such as the encoding.

Yes, tp_print is only an optimization for tp_repr and tp_str when writing to a "real" file object.

--Guido van Rossum (home page: http://www.python.org/~guido/)