[Python-Dev] str.translate vs unicode.translate (was: Re: str object going in Py3K) (original) (raw)

Bengt Richter bokr at oz.net
Fri Feb 17 03:25:25 CET 2006


If str becomes unicode for PY 3000, and we then have bytes as out coding-agnostic byte data, then I think bytes should have the str translation method, with a tweak that I would hope could also be done to str now.

BTW, str.translate will presumably become unicode.translate, so perhaps unicode.translate should grow a compatible deletechars parameter.

But that's not the tweak. The tweak is to eliminate unavoidable pre-conversion to unicode in str(something).translate(u'...', delchars) (and preemptively bytes(something).translate(u'...', delchars))

E.g. suppose you now want to write:

s_str.translate(table, delch).encode('utf-8')

Note that s_str has no encoding information, and translate is conceptually just a 1:1 substitution minus characters in delch. But if we want to do one-chr:one-unichr substitution by specifying a 256-long table of unicode characters, we cannot. It would be simple to allow it, and that's the tweak I would like. It would allow easy custom decodes.

At the moment, if you want to write the above, you have to introduce a phony latin-1 decoding and write it as (not typo-proof)

s_str.translate(table, delch).decode('latin-1').encode('utf-8')     # use str.translate

or s_str.decode('latin-1').translate(mapping).encode('utf-8') # use unicode.translate also for delch

to avoid exceptions if you have non-ascii in your s_str (even if delch would have removed them!!)

It seems s_str.translate(table, delchars) wants to convert the s_str to unicode if table is unicode, and then use unicode.translate (which bombs on delchars!) instead of just effectively defining str.translate as

def translate(self, table, deletechars=None):
    return ''.join((table or isinstance(table,unicode) and uidentity or sidentity)[ord(x)] for x in self
                   if not deletechars or x not in deletechars)

# For convenience in just pruning with deletechars, s_str.translate('', deletechars) deletes without translating,
# and s_str.translate(u'', deletechars)  does the same and then maps to same-ord unicode characters
# given
#     sidentity = ''.join(chr(i) for i in xrange(256))
# and
#     uidentity = u''.join(unichr(i) for i in xrrange(256)).

IMO, if you want unicode.translate, then it doesn't hurt to write unicode(s_str).translate and use that.

Let str.translate just use the str ords, so simple custom decodes can be written without the annoyance of e.g.,

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 3: ordinal not in range(128)

Can we change this for bytes? And why couldn't we change this for str.translate now? Or what am I missing? I certainly would like to miss the above message for str.translate :-(

BTW This would also allow taking advantage of features of both translates if desired, e.g. by s_str.translate(unichartable256, strdelchrs).translate(uniord_to_ustr_or_uniord_mapping). (e.g., the latter permits single to multiple-character substitution)

I think at least a tweaked translate method for bytes would be good for py3k, and I hope we can do it for str.translate now. It it is just too handy a high speed conversion goodie to forgo IMO.

Regards, Bengt Richter



More information about the Python-Dev mailing list