Issue 964929: Unicode String formatting does not correctly handle objects (original) (raw)

I have a problem with the way '%s' is handled in unicode strings when formatted. The Python Language refrence states that a unicode serialisation of an object should be in unicode, and I have seen python break down if unicode data is returned in str.

The problem is that there does not appear to be a way to interpolate the results from unicode within a string:

class EuroHolder: def init(self, price): self._price = price def str(self): return "%.02f euro" % self._price def unicode(self): return u"%.02f\u20ac" % self._price

class EuroHolder: ... def init(self, price): ... self._price = price ... def str(self): ... return "%.02f euro" % self._price ... def unicode(self): ... return u"%.02f\u20ac" % self._price ... e = EuroHolder(123.45) str(e) '123.45 euro' unicode(e) u'123.45\u20ac' "%s" % e '123.45 euro' u"%s" % e #this is wrong u'123.45 euro' u"%s" % unicode(e) # This is silly u'123.45\u20ac'

The first case is wrong, as I actually could cope with unicode data in the string I was substituting into, and I should be able to request the unicode data be put in.

The second case is silly, as the whole point of string substion variables such as %s, %d and %f is to remove the need for coercion on the right of the %.

Proposed solution #1: Make %s in unicode string substitution automatically check unicode() of the rvalue before trying str(). This is the most logical thing to expect of %s, if you insist on overloading it the way it currently does when a unicode object in the rvalue will ensure the result is unicode.

Proposed solution #2: Make a new string conversion operator, such as %S or %U which will explicitly call unicode() on the rvalue even if the lvalue is a non-unicode string

Solution #2 has the advantage that it does not break any previous behaviour of %s, and also allows for explicit conversion to unicode of 8-bits string in the lvalue.

I prefer solution #1 as I feel that the current operation of %s is incorrect, and it's unliekly to break much, whereas the "advantage" of converting 8-bit strings in the lvalue to unicode which solution #2 advocates will just lead to encoding problems and sloppy code.