[Python-Dev] Normalizing unicode? (was: Re: test_unicode_file failing on Mac OS X) (original) (raw)

Edward Loper edloper at gradient.cis.upenn.edu
Wed Dec 10 13:32:22 EST 2003


Scott David Daniels wrote:

I naïvely wrote: >Could we perhaps use a comparison that, in effect, did: > def uniequal(first, second): > if first == second: > return True > return first.normalize() == second.normalize() >That is, take advantage of the fact that normalization is often >unnecessary for "trivial" reasons.

[...]

Before we start considering how it's possible to make unicode.equal act encoding-insensitively[1], I think we need to consider whether that's really the behavior we want. In some ways, this seems like case-insensitive equality to me: it's certainly a useful operation, but I don't think it should be the object's builtin notion of equality..

And if you do want unicode objects to act normalized, then I think that the right way to do it is to normalize them at creation time. Then all the right hash/eq/cmp stuff just falls out.

But since some people will may want to distinguish different encodings of the same string, I think that the most sensible alternative is to add a new subclass to unicode -- something like "normalized_unicode." It would normalize itself at construction time; and when combined with other unicode strings (eg by +), the result would be normalized (so unicode+normalized_unicode -> normalized_unicode). It's possible that the normalized unicode class would be more useful to people (and therefore more widely used?), but the non-normalized version would still be available for people who want it.

(or we could just leave things as they are now, and force people to do any normalization themselves. :) )

-Edward

[1] I don't think that "encoding" is the right technical term here, but I'm not sure what the right term is. I mean insensitive to the difference between separated diacritics & unified diacritics.



More information about the Python-Dev mailing list