[Python-Dev] Relaxing Unicode error handling (original) (raw)

Phillip J. Eby pje at telecommunity.com
Sat Jan 3 10:51:51 EST 2004


At 12:44 PM 1/3/04 +0100, Martin v. Loewis wrote:

People keep complaining that they get Unicode errors when funny characters start showing up in their inputs.

In some cases, these people would apparantly be happy if Python would just "keep going", IOW, they want to get moji-bake (garbled characters), instead of exceptions that abort their applications. I'd like to add a static property unicode.errorhandling, which defaults to "strict". Applications could set this to "replace" or "ignore", silencing the errors, and risking loss of data. What do you think?

When I've gotten UnicodeErrors, it pointed out an error in my programming that needed to be fixed - i.e., that I forgot what kind of strings I was dealing with, and needed to be explicit about it, or at least use the replace/ignore option at the point of decoding. (Errors should not pass silently, unless explicitly silenced.)

A global setting makes it possible to create code that relies on the setting being one way or the other, and those pieces of code will not then work together. (Only one obvious way to do it.)

Admittedly, my experience with using Unicode is very limited, dealing primarily with the ISO-8859-x and Japanese language codecs, with decoding fairly centralized. It's possible that there are use cases I'm unfamiliar with that would scatter decode()'s all over the place, and that would make adding the "ignore" parameter to each use unbearably tedious. OTOH, I don't think that adding more stateful globals to Python is a good idea, and what's the harm of having somebody write:

def garble(s,codec): s.decode(codec,'ignore')

Or, if it's desired that this be available as part of Python, perhaps adding 'decode_replace' and 'decode_ignore' staticmethods to the Unicode class?

Or, am I missing the point entirely, and there's some other circumstance where one gets UnicodeErrors besides .decode()? If the use case is mixing strings and unicode objects (i.e. adding, joining, searching, etc.), then I'd have to say a big fat -1, as opposed to merely a -0 for having other ways to spell .decode(codec,"ignore"). If I in my youth had seen such a flag as you describe, I'd have used it, and then missed out on lots of very educational error messages.



More information about the Python-Dev mailing list