[Python-Dev] Unicode mapping tables (original) (raw)

Tim Peters tim_one@email.msn.com
Wed, 1 Mar 2000 01:50:44 -0500


[M.-A. Lemburg]

... Currently, mapping tables map characters to Unicode characters and vice-versa. Now the .translate method will use a different kind of table: mapping integer ordinals to integer ordinals.

You mean that if I want to map u"a" to u"A", I have to set up some sort of dict mapping ord(u"a") to ord(u"A")? I simply couldn't follow this.

Question: What is more of efficient: having lots of integers in a dictionary or lots of characters ?

My bet is "lots of integers", to reduce both space use and comparison time.

... Something else that changed is the way .capitalize() works. The Unicode version uses the Unicode algorithm for it (see TechRep. 13 on the www.unicode.org site).

#13 is "Unicode Newline Guidelines". I assume you meant #21 ("Case Mappings").

Here's the new doc string:

S.capitalize() -> unicode Return a capitalized version of S, i.e. words start with title case characters, all remaining cased characters have lower case. Note that all characters are touched, not just the first one. The change was needed to get it in sync with the .iscapitalized() method which is based on the Unicode algorithm too. Should this change be propogated to the string implementation ?

Unicode makes distinctions among "upper case", "lower case" and "title case", and you're trying to get away with a single "capitalize" function. Java has separate toLowerCase, toUpperCase and toTitleCase methods, and that's the way to do it. Whatever you do, leave .capitalize alone for 8-bit strings -- there's no reason to break code that currently works. "capitalize" seems a terrible choice of name for a titlecase method anyway, because of its baggage connotations from 8-bit strings. Since this stuff is complicated, I say it would be much better to use the same names for these things as the Unicode and Java folk do: there's excellent documentation elsewhere for all this stuff, and it's Bad to make users mentally translate unique Python terminology to make sense of the official docs.

So my vote is: leave capitalize the hell alone . Do not implement capitialize for Unicode strings. Introduce a new titlecase method for Unicode strings. Add a new titlecase method to 8-bit strings too. Unicode strings should also have methods to get at uppercase and lowercase (as Unicode defines those).