[Python-Dev] Normalizing unicode? (was: Re: test_unicode_file failing on Mac OS X) (original) (raw)

Guido van Rossum guido at python.org
Wed Dec 10 12:39:16 EST 2003


Before we start considering how it's possible to make unicode.equal act encoding-insensitively[1], I think we need to consider whether that's really the behavior we want. In some ways, this seems like case-insensitive equality to me: it's certainly a useful operation, but I don't think it should be the object's builtin notion of equality.. - I think people will be confused if s1==s2 but s1[0]!=s2[0]. - Sometimes you might want to distinguish different encodings of the "same" string; a "normalized" equality test makes that very difficult.

Right. Couldn't have said it better myself.

And if you do want unicode objects to act normalized, then I think that the right way to do it is to normalize them at creation time. Then all the right hash/eq/cmp stuff just falls out.

Exactly.

But since some people will may want to distinguish different encodings of the same string, I think that the most sensible alternative is to add a new subclass to unicode -- something like "normalizedunicode." It would normalize itself at construction time; and when combined with other unicode strings (eg by +), the result would be normalized (so unicode+normalizedunicode -> normalizedunicode). It's possible that the normalized unicode class would be more useful to people (and therefore more widely used?), but the non-normalized version would still be available for people who want it.

Works for me. I recomment that someone try this approach as a user subclass first -- this should be easy enough, right?

(or we could just leave things as they are now, and force people to do any normalization themselves. :) )

Do we even have normalization code in core Python?

--Guido van Rossum (home page: http://www.python.org/~guido/)



More information about the Python-Dev mailing list