MySQL :: MySQL 5.7 Reference Manual :: 10.10.1 Unicode Character Sets (original) (raw)

10.10.1 Unicode Character Sets

This section describes the collations available for Unicode character sets and their differentiating properties. For general information about Unicode, seeSection 10.9, “Unicode Support”.

MySQL supports multiple Unicode character sets:

utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
utf8: An alias forutf8mb3.
ucs2: The UCS-2 encoding of the Unicode character set using two bytes per character.
utf16: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Like ucs2 but with an extension for supplementary characters.
utf16le: The UTF-16LE encoding for the Unicode character set. Like utf16 but little-endian rather than big-endian.
utf32: The UTF-32 encoding for the Unicode character set using four bytes per character.

utf8mb4, utf16,utf16le, and utf32 support Basic Multilingual Plane (BMP) characters and supplementary characters that lie outside the BMP. utf8 anducs2 support only BMP characters.

Most Unicode character sets have a general collation (indicated by _general in the name or by the absence of a language specifier), a binary collation (indicated by_bin in the name), and several language-specific collations (indicated by language specifiers). For example, for utf8mb4,utf8mb4_general_ci andutf8mb4_bin are its general and binary collations, and utf8mb4_danish_ci is one of its language-specific collations.

Collation support for utf16le is limited. The only collations available areutf16le_general_ci andutf16le_bin. These are similar toutf16_general_ci andutf16_bin.

Unicode Collation Algorithm (UCA) Versions

MySQL implements the_`xxx`__unicode_ci collations according to the Unicode Collation Algorithm (UCA) described athttp://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys:http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The_`xxx`__unicode_ci collations have only partial support for the Unicode Collation Algorithm. Some characters are not supported, and combining marks are not fully supported. This affects primarily Vietnamese, Yoruba, and some smaller languages such as Navajo. A combined character is considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by theCHAR_LENGTH() function or in result set metadata).

Unicode collations based on UCA versions higher than 4.0.0 include the version in the collation name. Thus,utf8_unicode_520_ci is based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt).

The LOWER() andUPPER() functions perform case folding according to the collation of their argument. A character that has uppercase and lowercase versions only in a Unicode version higher than 4.0.0 is converted by these functions only if the argument collation uses a high enough UCA version.

Language-Specific Collations

MySQL implements language-specific Unicode collations if the ordering based only on the Unicode Collation Algorithm (UCA) does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring rules. Examples of such rules appear later in this section. For questions about particular language orderings,http://unicode.org provides Common Locale Data Repository (CLDR) collation charts athttp://www.unicode.org/cldr/charts/30/collation/index.html.

A language name shown in the following table indicates a language-specific collation. Unicode character sets may include collations for one or more of these languages.

Table 10.3 Unicode Collation Language Specifiers

Language	Language Specifier
Classical Latin	roman
Croatian	croatian
Czech	czech
Danish	danish
Esperanto	esperanto
Estonian	estonian
German phone book order	german2
Hungarian	hungarian
Icelandic	icelandic
Latvian	latvian
Lithuanian	lithuanian
Persian	persian
Polish	polish
Romanian	romanian
Sinhala	sinhala
Slovak	slovak
Slovenian	slovenian
Modern Spanish	spanish
Traditional Spanish	spanish2
Swedish	swedish
Turkish	turkish
Vietnamese	vietnamese

Croatian collations are tailored for these Croatian letters:Č, Ć,Dž, Đ,Lj, Nj,Š, Ž.

Danish collations may also be used for Norwegian.

For Classical Latin collations, I andJ compare as equal, andU and V compare as equal.

Spanish collations are available for modern and traditional Spanish. For both, ñ (n-tilde) is a separate letter between n ando. In addition, for traditional Spanish,ch is a separate letter betweenc and d, andll is a separate letter betweenl and m.

Traditional Spanish collations may also be used for Asturian and Galician.

Swedish collations include Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:

Ü = Y < Ö

_general_ci Versus _unicode_ci Collations

For any Unicode character set, operations performed using the_`xxx`__general_ci collation are faster than those for the_`xxx`__unicode_ci collation. For example, comparisons for theutf8_general_ci collation are faster, but slightly less correct, than comparisons forutf8_unicode_ci. The reason is thatutf8_unicode_ci supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example,ß is equal to ss in German and some other languages.utf8_unicode_ci also supports contractions and ignorable characters. utf8_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.

To further illustrate, the following equalities hold in bothutf8_general_ci andutf8_unicode_ci (for the effect of this in comparisons or searches, seeSection 10.8.6, “Examples of the Effect of Collation”):

Ä = A
Ö = O
Ü = U

A difference between the collations is that this is true forutf8_general_ci:

ß = s

Whereas this is true for utf8_unicode_ci, which supports the German DIN-1 ordering (also known as dictionary order):

ß = ss

MySQL implements utf8 language-specific collations if the ordering withutf8_unicode_ci does not work well for a language. For example, utf8_unicode_ci works fine for German dictionary order and French, so there is no need to create special utf8 collations.

utf8_general_ci also is satisfactory for both German and French, except that ß is equal to s, and not toss. If this is acceptable for your application, you should use utf8_general_ci because it is faster. If this is not acceptable (for example, if you require German dictionary order), useutf8_unicode_ci because it is more accurate.

If you require German DIN-2 (phone book) ordering, use theutf8_german2_ci collation, which compares the following sets of characters equal:

Ä = Æ = AE
Ö = Œ = OE
Ü = UE
ß = ss

utf8_german2_ci is similar tolatin1_german2_ci, but the latter does not compare Æ equal to AE or Œ equal to OE. There is no utf8_german_ci corresponding tolatin1_german_ci for German dictionary order because utf8_general_ci suffices.

Character Collating Weights

A character's collating weight is determined as follows:

For all Unicode collations except the_bin (binary) collations, MySQL performs a table lookup to find a character's collating weight.
For _bin collations, the weight is based on the code point, possibly with leading zero bytes added.

Collating weights can be displayed using theWEIGHT_STRING() function. (SeeSection 12.8, “String Functions and Operators”.) If a collation uses a weight lookup table, but a character is not in the table (for example, because it is a “new” character), collating weight determination becomes more complex:

For BMP characters in general collations (_`xxx`__general_ci), the weight is the code point.
For BMP characters in UCA collations (for example,_`xxx`__unicode_ci and language-specific collations), the following algorithm applies:

if (code >= 0x3400 && code <= 0x4DB5)  
  base= 0xFB80; /* CJK Ideograph Extension */  
else if (code >= 0x4E00 && code <= 0x9FA5)  
  base= 0xFB40; /* CJK Ideograph */  
else  
  base= 0xFBC0; /* All other characters */  
aaaa= base +  (code >> 15);  
bbbb= (code & 0x7FFF) | 0x8000;

The result is a sequence of two collating elements,aaaa followed bybbbb. For example:

mysql> SELECT HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci));  
+----------------------------------------------------------+  
| HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)) |  
+----------------------------------------------------------+  
| FBC084CF                                                 |  
+----------------------------------------------------------+

Thus, U+04cf CYRILLIC SMALL LETTER PALOCHKA is, with all UCA 4.0.0 collations, greater than U+04c0 CYRILLIC LETTER PALOCHKA. With UCA 5.2.0 collations, all palochkas sort together.

For supplementary characters in general collations, the weight is the weight for 0xfffd REPLACEMENT CHARACTER. For supplementary characters in UCA 4.0.0 collations, their collating weight is0xfffd. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.
An example with Deseret characters andCOUNT(DISTINCT):

CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci);  
INSERT INTO t VALUES (0xfffd);   /* REPLACEMENT CHARACTER */  
INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */  
INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */  
SELECT COUNT(DISTINCT s1) FROM t;

The result is 2 because in the MySQL_`xxx`__unicode_ci collations, the replacement character has a weight of0x0dc6, whereas Deseret Bee and Deseret Tee both have a weight of 0xfffd. (Were the utf32_general_ci collation used instead, the result is 1 because all three characters have a weight of 0xfffd in that collation.)
An example with cuneiform characters andWEIGHT_STRING():

/*  
The four characters in the INSERT string are  
00000041  # LATIN CAPITAL LETTER A  
0001218F  # CUNEIFORM SIGN KAB  
000121A7  # CUNEIFORM SIGN KISH  
00000042  # LATIN CAPITAL LETTER B  
*/  
CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci);  
INSERT INTO t VALUES (0x000000410001218f000121a700000042);  
SELECT HEX(WEIGHT_STRING(s1)) FROM t;

The result is:

0E33 FFFD FFFD 0E4A

0E33 and 0E4A are primary weights as inUCA 4.0.0. FFFD is the weight for KAB and also for KISH.
The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it is very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by the MySQL rule and secondarily by code point value, it is easy:

ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin

For supplementary characters based on UCA versions higher than 4.0.0 (for example,_`xxx`__unicode_520_ci), supplementary characters do not necessarily all have the same collating weight. Some have explicit weights from the UCA allkeys.txt file. Others have weights calculated from this algorithm:

aaaa= base +  (code >> 15);  
bbbb= (code & 0x7FFF) | 0x8000;

There is a difference between “ordering by the character's code value” and “ordering by the character's binary representation,” a difference that appears only with utf16_bin, because of surrogates.

Suppose that utf16_bin (the binary collation for utf16) was a binary comparison “byte by byte” rather than“character by character.” If that were so, the order of characters in utf16_bin would differ from the order in utf8_bin. For example, the following chart shows two rare characters. The first character is in the rangeE000-FFFF, so it is greater than a surrogate but less than a supplementary. The second character is a supplementary.

Code point  Character                    utf8         utf16
----------  ---------                    ----         -----
0FF9D       HALFWIDTH KATAKANA LETTER N  EF BE 9D     FF 9D
10384       UGARITIC LETTER DELTA        F0 90 8E 84  D8 00 DF 84

The two characters in the chart are in order by code point value because 0xff9d <0x10384. And they are in order byutf8 value because 0xef < 0xf0. But they are not in order byutf16 value, if we use byte-by-byte comparison, because 0xff >0xd8.

So MySQL's utf16_bin collation is not“byte by byte.” It is “by code point.” When MySQL sees a supplementary-character encoding in utf16, it converts to the character's code-point value, and then compares. Therefore,utf8_bin and utf16_bin are the same ordering. This is consistent with the SQL:2008 standard requirement for a UCS_BASIC collation:“UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11: The Unicode scalar value of a character is its code point treated as an unsigned integer.”

If the character set is ucs2, comparison is byte-by-byte, but ucs2 strings should not contain surrogates, anyway.

Miscellaneous Information

The_`xxx`__general_mysql500_ci collations preserve the pre-5.1.24 ordering of the original_`xxx`__general_ci collations and permit upgrades for tables created before MySQL 5.1.24 (Bug #27877).