Issue 13828: Further improve casefold documentation (original) (raw)

Created on 2012-01-19 17:06 by Jim.Jewett, last changed 2022-04-11 14:57 by admin.

Messages (11)
msg151644 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-19 17:06
> http://hg.python.org/cpython/rev/0b5ce36a7a24 > changeset: 74515:0b5ce36a7a24 > + Casefolding is similar to lowercasing but more aggressive because it is > + intended to remove all case distinctions in a string. For example, the German > + lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already > + lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold` > + converts it to ``"ss"``. Perhaps add the recommendation to canonicalize as well. A complete, but possibly too long, try is below: Casefolding is similar to lowercasing but more aggressive because it is intended to remove all case distinctions in a string. For example, the German lowercase letter ``'ß'`` is equivalent to ``"ss"``. Since it is already lowercase, :meth:`lower` would do nothing to ``'ß'``; :meth:`casefold` converts it to ``"ss"``. Note that most case-insensitive matches should also match compatibility equivalent characters. The casefolding algorithm is described in section 3.13 of the Unicode Standard. Per D146, a compatibility caseless match can be achieved by from unicodedata import normalize def caseless_compat(string): nfd_string = normalize("NFD", string) nfkd1_string = normalize("NFKD", nfd_string.casefold()) return normalize("NFKD", nfkd1_string.casefold())
msg151645 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2012-01-19 17:09
Frankly, I do think that sample code is too long, but correctness matters ... perhaps a better solution would be to add either a method or a unicodedata function that does the work, then the extra note could just say Note that most case-insensitive matches should also match compatibility equivalent characters; see unicodedata.compatibity_casefold
msg151665 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2012-01-20 01:12
It's a bit unfriendly to launch into discussion of "compatiblity caseless matching" when the new reader probably has no idea what "compatibility-equivalence" is.
msg253662 - (view)	Author: Mark Summerfield (mark) *	Date: 2015-10-29 07:14
I think the str.casefold() docs are fine as far as they go, rightly covering what it _does_ rather than _how_, yet providing a reference for the details. But what they lack is more complete information. For example I discovered this: >>> x = "ﬁles and shuﬄes" >>> x 'ﬁles and shuﬄes' >>> x.casefold() 'files and shuffles' In view of this I would add one sentence: In addition to lowercasing, this function also expands ligatures, for example, "ﬁ" becomes "fi".
msg253797 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2015-10-31 15:36
> In addition to lowercasing, this function also expands ligatures, for example, "ﬁ" becomes "fi". +1 I would have found that sentence to be helpful.
msg327334 - (view)	Author: Marc Richter (Marc Richter)	Date: 2018-10-08 09:33
+1 as well. To be honest, I did not understand what this function does in detail yet. Since not too long ago (2017) in Germany, there was an uppercase-variant for the special letter from this function's example (ß) been added to the official orthography [1]. Is this something that needs to be changed in this function's behavior now or stays this expected behavior? I'm still puzzled and I think the whole function should get a more clear description. [1]: https://en.wikipedia.org/wiki/Capital_%E1%BA%9E
msg338689 - (view)	Author: Cheryl Sabella (cheryl.sabella) *	Date: 2019-03-23 16:53
Assigning to @Mariatta for the sprints.
msg375842 - (view)	Author: Thorsten (MrSupertash)	Date: 2020-08-24 13:48
German example in casefolding is plain incorrect. #Casefolding is similar to lowercasing but more aggressive because it is #intended to remove all case distinctions in a string. For example, the #German lowercase letter 'ß' is equivalent to "ss". Since it is already #lowercase, lower() would do nothing to 'ß'; casefold() converts it to #"ss". It is not true that "ß" is equivalent to "ss" and has not been since an orthography reform in 1996. These are to be used in distinct use cases. "ß" after a diphthong or a long/open vowel. "ss" after a short/closed vowel. The documentation correctly describes (in this case) how Python handles the .casefold() for this letter, although the behavior itself is incorrect. As mentioned before, in 2017 an official upper-case version of "ß" has been introduced into German orthography: "ẞ". The German example should be stated as current incorrect behavior in the documentation. +1 to adding previously mentioned sentence: In addition to lowercasing, this function also expands ligatures, for example, "ﬁ" becomes "fi".
msg375844 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2020-08-24 13:52
Correctness of casefolding is defined by the Unicode standard, which currently states that "ß" folds to "ss".
msg375847 - (view)	Author: Thorsten (MrSupertash)	Date: 2020-08-24 15:01
I see. I found the documents. That's an issue. That usage is incorrect. It is still valid to upper case "ß" to SS since "ẞ" is fairly new as an official German character, but the other way around is not valid. As such the current sentence in documentation also just does not make sense. >"Since it is already lowercase, lower() would do nothing to 'ß'" Exactly. Why would it? It is nonsensical to change an already lowercase character with a lowercase function. Suggest to update to: "For example, the Unicode standard for German lower case letter 'ß' prescribes full casefolding to 'ss'. Since it is already lowercase, lower() would do nothing to 'ß'; casefold() converts it to 'ss'. In addition to full lowercasing, this function also expands ligatures, for example, 'ﬁ' becomes 'fi'."
msg375858 - (view)	Author: Jim Jewett (Jim.Jewett) *	Date: 2020-08-24 17:39
Unicode probably won't make the correction, because of backwards compatibility. I do support the sentence suggested in Thorsten's most recent reply. Is expanding ligatures the only other normalization it does? Ideally, we should also mention that it shifts to the canonical case, which is usually (but not always) lowercase. I think Cherokee is one that folds to the upper case. On Mon, Aug 24, 2020 at 11:02 AM Thorsten <report@bugs.python.org> wrote: > > Thorsten <mrsupertash@gmail.com> added the comment: > > I see. I found the documents. That's an issue. That usage is incorrect. It > is still valid to upper case "ß" to SS since "ẞ" is fairly new as an > official German character, but the other way around is not valid. > > As such the current sentence in documentation also just does not make > sense. > > >"Since it is already lowercase, lower() would do nothing to 'ß'" > > Exactly. Why would it? It is nonsensical to change an already lowercase > character with a lowercase function. > > Suggest to update to: > > "For example, the Unicode standard for German lower case letter 'ß' > prescribes full casefolding to 'ss'. Since it is already lowercase, lower() > would do nothing to 'ß'; casefold() converts it to 'ss'. > In addition to full lowercasing, this function also expands ligatures, for > example, 'ﬁ' becomes 'fi'." > > ---------- > > _______________________________________ > Python tracker <report@bugs.python.org> > <https://bugs.python.org/issue13828> > _______________________________________ >

History
Date	User	Action	Args
2022-04-11 14:57:25	admin	set	github: 58036
2020-08-24 17:39:41	Jim.Jewett	set	messages: +
2020-08-24 15:01:42	MrSupertash	set	messages: +
2020-08-24 13:52:36	benjamin.peterson	set	messages: +
2020-08-24 13:48:30	MrSupertash	set	nosy: + MrSupertashmessages: +
2019-03-23 16:53:57	cheryl.sabella	set	versions: + Python 3.7, Python 3.8, - Python 3.3nosy: + Mariatta, cheryl.sabellamessages: + assignee: docs@python -> Mariattastage: needs patch
2018-10-08 09:33:46	Marc Richter	set	nosy: + Marc Richtermessages: +
2015-10-31 15:36:13	rhettinger	set	nosy: + rhettingermessages: +
2015-10-29 07:14:19	mark	set	nosy: + markmessages: +
2012-01-20 01:12:41	benjamin.peterson	set	messages: +
2012-01-19 17:09:52	Jim.Jewett	set	messages: +
2012-01-19 17:06:02	Jim.Jewett	create