[Python-Dev] test_unicode_file failing on Mac OS X (original) (raw)

Martin v. Löwis martin at v.loewis.de
Sun Dec 7 12:56:54 EST 2003


Jack Jansen <Jack.Jansen at cwi.nl> writes:

This is probably related to the two flavors of unicode there are, one which prefers to have all accents separately from the letters as much as possible and one which prefers the reverse. I keep forgetting the names of the two, they're somewhat silly.

OS X uses what is called the "decomposed normal form", splitting combined characters into the base character and the combining accent.

Python supports either form, but will use precomposed characters more often than not.

And while there are algorithms to convert the combined form of unicode to the uncombined form and vice versa there are no Python codecs to do this.

Not as a codec, but as unicodedata.normalize. If you do

unicodedata.normalize(composed_string, "NFD")

you get the string that OS X wants you to use.

Of course, with Unicode-on-Windows, the story is mostly vice-versa. NTFS/Win32 does not perform any normalization, so you can actually store the precomposed and the decomposed string simultaneously in the same directory (which is confusing). The platform codecs always generate the precomposed form, though, so you are more likely to find the precomposed form on disk.

For the test, it would be best to compare normal forms, and have the test pass if the normal forms (NFD) are equal.

Regards, Martin



More information about the Python-Dev mailing list