[Python-Dev] Divorcing str and unicode (no more implicit conversions). (original) (raw)

Jim Fulton jim at zope.com
Mon Oct 3 16:49:44 CEST 2005


Martin Blais wrote:

Hi.

Like a lot of people (or so I hear in the blogosphere...), I've been experiencing some friction in my code with unicode conversion problems. Even when being super extra careful with the types of str's or unicode objects that my variables can contain, there is always some case or oversight where something unexpected happens which results in a conversion which triggers a decode error. str.join() of a list of strs, where one unicode object appears unexpectedly, and voila! exception galore. Sometimes the problem shows up late because your test code doesn't always contain accented characters. I'm sure many of you experienced that or some variant at some point. I came to realize recently that this problem shares strong similarity with the problem of implicit type conversions in C++, or at least it feels the same: Stuff just happens implicitly, and it's hard to track down where and when it happens by just looking at the code. Part of the problem is that the unicode object acts a lot like a str, which is convenient, but...

I agree. I think it was a mistake to implicitly convert mixed string expressions to unicode.

What if we could completely disable the implicit conversions between unicode and str? In other words, if you would ALWAYS be forced to call either .encode() or .decode() to convert between one and the other... wouldn't that help a lot deal with that issue?

Perhaps.

How hard would that be to implement?

Not hard. We considered doing it for Zope 3, but ...

Would it break a lot of code?

Yes.

Would some people want that?

No, I wouldn't want lots of code to break. ;)

(I know I would, at least for some of my code.) It seems to me that this would make the code more explicit and force the programmer to become more aware of those conversions. Any opinions welcome.

I think it's too late to change this. I wish it had been done differently. (OTOH, I'm very happy we have Unicode support, so I'm not really complaining. :)

I'll note that this hasn't been that much of a problem for us in Zope. We follow the strategy:

Antoine Pitrou wrote: ...

A good rule of thumb is to convert to unicode everything that is semantically textual, and to only use str for what is to be semantically treated as a string of bytes (network packets, identifiers...). This is also, AFAIU, the semantic model which is favoured for a hypothetical future version of Python.

This approach has worked pretty well for us. Still, when there is a problem, it's a real pain to debug because the error occurs too late, as you point out.

Jim

-- Jim Fulton mailto:jim at zope.com Python Powered! CTO (540) 361-1714 http://www.python.org Zope Corporation http://www.zope.com http://www.zope.org



More information about the Python-Dev mailing list