UTR#21: Case Mappings (original) (raw)

Unicode Standard Annex #21

Version 3.2.0
Authors Mark Davis (mark.davis@us.ibm.com, home)
Date 2001.03.26
This Version http://www.unicode.org/unicode/reports/tr21/tr21-5
Previous Version http://www.unicode.org/unicode/reports/tr21/tr21-4.3
Latest Version http://www.unicode.org/unicode/reports/tr21
Tracking Number 5

Summary

_This document presents requirements for default case operations: case conversion, case detection, and caseless matching. These are the default definitions to be used in the absence of tailoring for particular languages and environments.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. It is a stable document and may be used as reference material or cited as a normative reference from another document.

A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. Note that conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.

The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).

Contents

Contents


1 Introduction

Case is a normative property of characters in specific alphabets (Latin, Greek, Cyrillic, Armenian, and archaic Georgian) whereby characters are considered to be variants of a single letter. These variants, which may differ markedly in shape and size, are called the uppercase letter (also known as capital or majuscule) and the lower­case letter (also known as small or minuscule). The uppercase letter is generally larger than the lowercase letter. Alphabets with case differences are called bicameral; those without are called unicameral.

Note: while the archaic Georgian script contained uppercase and lowercase pairs, they are not used as such in modern Georgian.

Because of the inclusion of certain composite characters for compatibility, such as U+01F1 "DZ" LATIN CAPITAL LETTER DZ, there is a third case, called titlecase, which is used where the first character of a word is to be capitalized. An example of such a character is: U+01F2 "Dz" LATIN CAPITAL LETTER D WITH SMALL LETTER Z.

Thus the three case forms for characters are UPPERCASE, Titlecase, and lowercase.

Note: The term titlecase can also be used to refer to words where the first letter is an uppercase or titlecase letter, and the rest of the letters are lowercase. However, not all words in the title of a document or first words in a sentence will be titlecase.

The choice of which words to titlecase is language-dependent. For example, "Taming of the Shrew" would be the appropriate capitalization in English, not "Taming Of The Shrew". Moreover, the determination of what actually constitutes a word is also language-dependent. For example, l'arbre might be considered two words in French, while can't is considered one word in English.

There are a number of complications to case mappings that occur once the repertoire of characters is expanded beyond ASCII.

1.1 Reversibility

It is important to note that no casing operations on strings are reversible. For example,

toUppercase(toLowercase(“John Brown”)) → “JOHN BROWN”

toLowercase(toUppercase(“John Brown”)) → “john brown”.

There are even single words like vederLa in Italian or the name McGowan in English, which are neither upper, lower, nor titlecase. This format is sometimes called innerCaps, and is often used in programming and in Web names. Once the string "McGowan" has been uppercased, lowercased or titlecased, the original cannot be recovered by applying another uppercase, lowercase, or titlecase operation. There are also single characters that do not have reversible mappings, such as the Greek sigmas above.

For word processors that use a single command-key sequence to toggle the selection through different casings, it is recommended to save the original string, and return to it in the sequence of keys. The user interface would produce the following results in response to a series of command-keys. Notice that the original string is restored every fourth time.

  1. The quick brown
  2. THE QUICK BROWN
  3. the quick brown
  4. The Quick Brown
  5. The quick brown (repeating from here on)

Uppercase, titlecase, and lowercase can be represented in a word processor by using a character style. Removing the character style restores the text to its original state. However, if this approach is taken, any spell-checking software needs to be aware of the case style so that it can check the spelling according to the actual appearance.

1.2 Data

The Unicode Character Database contains four files with information that is relevant to case mapping:

[UnicodeData] Contains the case mappings that map to a single character. These do not increase the length of strings, and do not contain context-dependent mappings.Only legacy implementations that cannot handle case mappings that increase string lengths use UnicodeData case mappings alone. The single-character mappings are insufficient for languages such as German.
[SpecialCasing] Contains additional case mappings that map to more than one character, such as "ß" to "SS". It also contains context-dependent mappings, with flags to distinguish them from the normal mappings. There are some characters that have a "best" single-character mapping in UnicodeData and also have a full mapping in SpecialCasing.
[CaseFolding] Contains data for performing locale-independent case-folding, as described in 2.3 Caseless Matching.
[CoreProps] Contains definitions of the properties Lowercase and Uppercase.

A set of charts that show the latest case mappings in are also available online.

In addition, Normalization Form D (NFD) from UAX #15, "Unicode Normalization Forms is used in the definitions for case mapping.

The full case mappings for Unicode characters are obtained by using the mappings from SpecialCasing plus the mappings from UnicodeData, excluding any latter mappings that would conflict. Any character that does not have a mapping in these files is considered to map to itself. In this document, the full case mappings of a character C are referred to as UCD_lower(C), UCD_title(C), and UCD_upper(C). The full case folding of a character C is referred to as UCD_fold(C).

When used in case operations, these mappings may depend on the context around each character in the original string. There are very few mappings that require the context, but they are required for correct operation. Because there are very few context-dependent case mappings, implementations may choose to hard-code the treatment of these characters rather than use data-driven code based on the UCD. When this is done, every time the implementation is upgraded to a new version of Unicode, the code must be checked for consistency with the updated data.

1.3 Caseless Matching

Caseless matching is implemented using case-folding. The latter is the process of mapping strings to a canonical form where case differences are erased. Case-folding allows for fast caseless matches in lookups, since only binary comparison is required. Case-folding is more than just conversion to lowercase. For example, it handles cases such as the Greek sigma, so that "Μάϊος" and "ΜΆΪΟΣ" will match correctly.

Note: normally the original source string is not replaced by the folded string, since that may erase important information. For example, the name "Marco di Silva" would be folded to "marco di silva", losing the information as to which letters are capitalized. What is typically done is that the original string is stored along with a case-folded version for fast comparisons.

The [CaseFolding] file in the Unicode Character Database is used for performing locale-independent case-folding. This file is generated from the case mappings in the Unicode Character Database, using both the single-character mappings and the multi-character mappings. It folds all characters having different case forms together into a common form. To compare two strings for caseless matching, you can fold each string using this data, and then use a binary comparison.

For those concerned with the details. Case-folding logically involves a set of equivalence classes, constructed from the Unicode Character Database case mappings as follows.

For each character X in Unicode:

  1. If X is already in an equivalence class, continue to next character.
  2. Otherwise, form a new equivalence class, and add X.
  3. Then add whatever upper-, lower- or titlecases to anything in the set.
  4. Then add whatever anything in the set upper-, lower- or titlecases to.
  5. Repeat #3 and #4 until nothing further is added.

Each equivalence class is completely disjoint from all the others, and together they form a partition of the entire Unicode code space. From each class, one representative element (a single lowercase letter where possible) is chosen to be the common form. [CaseFolding] thus contains the mappings from other characters in the equivalence characters to their common forms.

Generally, where case distinctions are not important, other distinctions between Unicode characters (in particular, compatibility distinctions) are ignored as well. In such circumstances, text can be normalized to Normalization Form KC or KD after case-folding, to produce a normalized form that erases both compatibility distinctions and case distinctions. (See UTR #15: Unicode Normalization Forms for more information.) However, such normalization should generally only be done on a restricted repertoire, such as identifiers (alphanumerics).

Caseless matching itself is only an approximation to the language-specific rules governing the strength of comparisons. Where language-specific case matching is used, this information can be derived from the collation data for the language, where only the first and second level differences are used. For more information, see UTR #10: Unicode Collation Algorithm.

However, in most environments, such as in file systems, text is not and cannot be tagged with language-specific information. In such cases, the language-specific mappings must not be used. Otherwise data structures such as B-trees, might be built based on one set of case-foldings, and used based on a different set. This will cause those data structures to become corrupt. For such environments, a constant, language-independent, default case-folding is required.

1.4 Normalization

Casing operations as defined below do not preserve normalization form. That is, there are strings in a particular normalization form (e.g. NFC) that will no longer be in that form after the casing operation is performed. For example: consider the following strings

Original (NFC) ǰ◌̱ U+01F0 LATIN SMALL LETTER J WITH CARON, U+0323 COMBINING DOT BELOW
Uppercased J◌̌◌̱ U+004A LATIN CAPITAL LETTER J, U+030C COMBINING CARON, U+0323 COMBINING DOT BELOW
Uppercased NFC J◌̱◌̌ U+004A LATIN CAPITAL LETTER J, U+0323 COMBINING DOT BELOW, U+030C COMBINING CARON,

The original string is in NFC format. When uppercased, the small j with caron turns into an uppercase J with a separate caron. If followed by a BELOW combining mark, it is denormalized. The combining marks have to be put in canonical order for it to be normalized.

If text in a particular system is to be consistently normalized to a particular form such as NFC, then the casing operators should be modified to normalize after performing their core function. The actual process can be optimized; there are only a few instances where a casing operation causes a string to become denormalized. If those instances are specifically checked for, then normalization can be avoided where not needed.

Normalization also interacts with case folding. For any string X, let Q(X) = NFC(toCasefold(X)). In other words, Q is the result of casefolding X, then putting the result into NFC format. Because of the way normalization and case folding are defined, Q(Q(X)) = Q(X). Thus repeatedly applying Q does not change the result; case folding is closed under canonical normalization (either NFC or NFD).

Case folding is not, however, closed under compatibility normalization (either NFKD or NFKC). That is, given R(X) = NFC(toCasefold(X)), there are some strings such that R(R(X)) != R(X). There is a derived property, FC_NFKC_Closure, that contains the additional mappings that can be used to produce a compatibility-closed case folding. This set of mappings is found in [DNormProps].

2 Operations

The following section specifies the default operations for case conversion, case detection, and caseless matching.

2.1 Conformance

C1 An implementation that purports to support the default casing operations of case conversion, case detection, and caseless mapping shall do so in accordance with the definitions and specifications below.

The default casing operations are to be used in the absence of tailoring for particular languages and environments. Where a particular environment (such as a Slovak locale) requires tailoring, that can be done without breaking conformance.

All the specifications are logical specifications; particular implementations can optimize the processes as long as the provide the same results.

2.2 Definitions

Detection of case and case mapping requires more than just the general category values (Lu, Lt, Ll). The following definitions are used:

D1. A character C is defined to be cased if it meets any of the following criteria:

D2. A character C is defined to be case-ignorable if it meets either of the following criteria:

D3. A case-ignorable sequence is a sequence of zero or more case-ignorable characters.

D3. A character C is in a particular casing context just in case it matches the corresponding specification given by the following table:

Context Specification

Context Specification Regular Expression
Final_Sigma C is preceded by a sequence consisting of a cased letter and a case-ignorable sequence, and C is not followed by a sequence consisting of an ignorable sequence and then a cased letter. Before *
After !(* )
More_Above C is followed by one or more characters of combining class 230 (ABOVE) in the combining character sequence. After <cc!=0>* <cc=230>
After_Soft_Dotted The last preceding character with combining class of zero before C was Soft_Dotted, and there is no intervening combining character class 230 (ABOVE). Before <Soft_Dotted> (<cc!=230> & <cc!=0>)*
Before_Dot C is followed by combining dot above (U+0307). Any sequence of characters with a combining class that is neither 0 nor 230 may intervene between the current character and the combining dot above. After (<cc!=230> & <cc!=0>)* U+0307

The regular expression column provides an equivalent formulation to the specification for those who find it more clear. The syntax uses <...> to indicate a character that matches the specified property.

2.3 Case Conversion of Strings

The following specify the default case conversion operations for Unicode strings, in the absence of tailoring. In each instance, there are two variants: simple case conversion and full case conversion. In the full case conversion, the context-dependent mappings mentioned above must be used.

S1. toUppercase(X)

S2. toLowercase(X)

S3. toTitlecase(X)

toCasefold(X)

2.4 Case Detection for Strings

The specification of the case of a string is based upon the case conversion operations.

Given a string X, and a string Y = NFD(X), then:

Examples:

Lowercase a john smith a2 3
Uppercase A JOHN SMITH A2 3
Titlecase A John Smith A2 3

As seen from the examples, these conditions are not exclusive. "A2" is both uppercase and titlecase; "3" is uncased, so it is lowercase, uppercase and titlecase.

2.5 Caseless Matching

Default caseless matching is specified by the following:

As described above, normally caseless matching should also use normalization, thus one of the following operations:

References

Modifications

The following summarizes modifications from the previous versions of this document.

5 Expanded definitions to take the new Lowercase and Titlecase properties into account. This also allowed the definitions to be simplified. Added conformance and definitions sections Moved conditions in from SpecialCasing.txt Added a discussion of Normalization Minor editing
4.3 Defined the sets lower, title, upper, and uniqueUpper instead of relying on the general category. Introduced UCD_title, UCD_upper, UCD_lower notation. Reordered sections of text for clarity Minor editing
4.2 Fixed pointer for CaseFolding.txt to point to the UCDAdded text to describe the CaseFolding.txt generation in terms of equivalence classes Added Modification section Minor editing

Copyright © 1999-2002 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.