[Python-ideas] adding a casefold() method to str (original) (raw)

Steven D'Aprano steve at pearwood.info
Sun Jan 8 17:58:17 CET 2012


Benjamin Peterson wrote:

Hi, Casefolding (Unicode Standard 3.13) is a more aggressive version of lowercasing. It's purpose to assist in the implementation of caseless mapping. For example, under lowercase "ß" -> "ß" but under casefolding "ß" -> "ss". I propose we add a casefold() method. So, case-insensitive matching should really be "one.casefold() == two.casefold()" rather than "one.lower() == two.lower()".

+1 in principle, but in practice case folding is more complicated than a single method might imply. The most obvious complication is treatment of dotted and dotless I.

See, for example:

http://unicode.org/Public/UNIDATA/CaseFolding.txt http://www.w3.org/International/wiki/Case_folding http://en.wikipedia.org/wiki/Letter_case#Unicode_case_folding_and_script_identification

So while having proper Unicode case-folding is desirable, I don't know how simple it is to implement.

Would it be appropriate for casefold() to take an optional argument as to which mappings to use? E.g. something like:

str.casefold() # defaults to simple folding str.casefold(string.SIMPLE & string.TURKIC) str.casefold(string.FULL)

or should str.casefold() only apply simple folding, with the others combinations relegated to a function in a module somewhere?

I count 4 possible functions:

simple casefolding, without Turkic I full casefolding, without Turkic I simple casefolding, with Turkic I full casefolding, with Turkic I

-- Steven



More information about the Python-ideas mailing list