[Python-Dev] Re: test_unicode_file failing on Mac OS X (original) (raw)
Scott David Daniels Scott.Daniels at Acm.Org
Thu Dec 11 09:27:59 EST 2003
- Previous message: [Python-Dev] Re: test_unicode_file failing on Mac OS X
- Next message: [Python-Dev] Re: test_unicode_file failing on Mac OS X
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I naïvely wrote:
Could we perhaps use a comparison that, in effect, did: def uni_equal(first, second): if first == second: return True return first.normalize() == second.normalize() That is, take advantage of the fact that normalization is often unnecessary for "trivial" reasons.
This works, and a similar "unequal" trick may be constructible. Ordering is certainly trickier (assuring we have a total order given new equalities, so that we cannot choose a, b, and c where: a < b = c > a is True.
But, Martin v. Löwis points out:
It also affects hashing, if Unicode objects are used as dictionary keys. Objects that compare equal need to hash equal.
Still not disgusting, but unicode strings must hash equal to the corresponding "plain" string. I am not certain about this requirement for non-ASCII characters, but I expect we are stuck with matching hashes in the range ord(' ') through ord('~') and probably for all character values from 0 through 127. We might be able to classify UTF-16 code units into three groups:
- matches base ASCII character
- diacritical or combining
- definitely distinct from any ASCII or combining form. If we map the group 1 entries to the corresponding ASCII code, skip the group 2s, and take the group 3s separately (probably remapping to another set), we might come up with a hash that used only the map results as elements contributing to the hash.
Are we stuck with the current hash for unicode? If so, there is little hope. If not, this might bear further investigation.
-Scott David Daniels Scott.Daniels at Acm.Org
- Previous message: [Python-Dev] Re: test_unicode_file failing on Mac OS X
- Next message: [Python-Dev] Re: test_unicode_file failing on Mac OS X
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]