MySQL :: MySQL 8.4 Reference Manual :: 12.10.1 Unicode Character Sets (original) (raw)
12.10.1 Unicode Character Sets
This section describes the collations available for Unicode character sets and their differentiating properties. For general information about Unicode, seeSection 12.9, “Unicode Support”.
MySQL supports multiple Unicode character sets:
utf8mb4
: A UTF-8 encoding of the Unicode character set using one to four bytes per character.utf8mb3
: A UTF-8 encoding of the Unicode character set using one to three bytes per character. This character set is deprecated; please useutf8mb4
instead.utf8
: A deprecated alias forutf8mb3
. Useutf8mb4
instead.
Noteutf8
is expected in a future release to become an alias forutf8mb4
.ucs2
: The UCS-2 encoding of the Unicode character set using two bytes per character. Deprecated; expect support for this character set to be removed in a future version of MySQL.utf16
: The UTF-16 encoding for the Unicode character set using two or four bytes per character. Likeucs2
but with an extension for supplementary characters.utf16le
: The UTF-16LE encoding for the Unicode character set. Likeutf16
but little-endian rather than big-endian.utf32
: The UTF-32 encoding for the Unicode character set using four bytes per character.
Note
The utf8mb3
character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4
instead.utf8
is currently an alias forutf8mb3
, but it is now deprecated as such, and utf8
is expected subsequently to become a reference to utf8mb4
.utf8mb3
is also displayed in place ofutf8
in columns of Information Schema tables, and in the output of SQL SHOW
statements.
To avoid ambiguity about the meaning ofutf8
, consider specifyingutf8mb4
explicitly for character set references.
utf8mb4
, utf16
,utf16le
, and utf32
support Basic Multilingual Plane (BMP) characters and supplementary characters that lie outside the BMP. utf8mb3
and ucs2
support only BMP characters.
Most Unicode character sets have a general collation (indicated by _general
in the name or by the absence of a language specifier), a binary collation (indicated by_bin
in the name), and several language-specific collations (indicated by language specifiers). For example, for utf8mb4
,utf8mb4_general_ci
andutf8mb4_bin
are its general and binary collations, and utf8mb4_danish_ci
is one of its language-specific collations.
Most character sets have a single binary collation.utf8mb4
is an exception that has two:utf8mb4_bin
andutf8mb4_0900_bin
. These two binary collations have the same sort order but are distinguished by their pad attribute and collating weight characteristics. SeeCollation Pad Attributes, andCharacter Collating Weights.
Collation support for utf16le
is limited. The only collations available areutf16le_general_ci
andutf16le_bin
. These are similar toutf16_general_ci
andutf16_bin
.
- Unicode Collation Algorithm (UCA) Versions
- Collation Pad Attributes
- Language-Specific Collations
- _general_ci Versus _unicode_ci Collations
- Character Collating Weights
- Miscellaneous Information
Unicode Collation Algorithm (UCA) Versions
MySQL implements the_`xxx`__unicode_ci
collations according to the Unicode Collation Algorithm (UCA) described athttp://www.unicode.org/reports/tr10/. The collation uses the version-4.0.0 UCA weight keys:http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt. The_`xxx`__unicode_ci
collations have only partial support for the Unicode Collation Algorithm. Some characters are not supported, and combining marks are not fully supported. This affects languages such as Vietnamese, Yoruba, and Navajo. A combined character is considered different from the same character written with a single unicode character in string comparisons, and the two characters are considered to have a different length (for example, as returned by theCHAR_LENGTH() function or in result set metadata).
Unicode collations based on UCA versions higher than 4.0.0 include the version in the collation name. Examples:
utf8mb4_unicode_520_ci
is based on UCA 5.2.0 weight keys (http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt),utf8mb4_0900_ai_ci
is based on UCA 9.0.0 weight keys (http://www.unicode.org/Public/UCA/9.0.0/allkeys.txt).
The LOWER() andUPPER() functions perform case folding according to the collation of their argument. A character that has uppercase and lowercase versions only in a Unicode version higher than 4.0.0 is converted by these functions only if the argument collation uses a high enough UCA version.
Collation Pad Attributes
Collations based on UCA 9.0.0 and higher are faster than collations based on UCA versions prior to 9.0.0. They also have a pad attribute of NO PAD
, in contrast to PAD SPACE
as used in collations based on UCA versions prior to 9.0.0. For comparison of nonbinary strings, NO PAD
collations treat spaces at the end of strings like any other character (seeTrailing Space Handling in Comparisons).
To determine the pad attribute for a collation, use theINFORMATION_SCHEMA
COLLATIONS table, which has aPAD_ATTRIBUTE
column. For example:
mysql> SELECT COLLATION_NAME, PAD_ATTRIBUTE
FROM INFORMATION_SCHEMA.COLLATIONS
WHERE CHARACTER_SET_NAME = 'utf8mb4';
+----------------------------+---------------+
| COLLATION_NAME | PAD_ATTRIBUTE |
+----------------------------+---------------+
| utf8mb4_general_ci | PAD SPACE |
| utf8mb4_bin | PAD SPACE |
| utf8mb4_unicode_ci | PAD SPACE |
| utf8mb4_icelandic_ci | PAD SPACE |
...
| utf8mb4_0900_ai_ci | NO PAD |
| utf8mb4_de_pb_0900_ai_ci | NO PAD |
| utf8mb4_is_0900_ai_ci | NO PAD |
...
| utf8mb4_ja_0900_as_cs | NO PAD |
| utf8mb4_ja_0900_as_cs_ks | NO PAD |
| utf8mb4_0900_as_ci | NO PAD |
| utf8mb4_ru_0900_ai_ci | NO PAD |
| utf8mb4_ru_0900_as_cs | NO PAD |
| utf8mb4_zh_0900_as_cs | NO PAD |
| utf8mb4_0900_bin | NO PAD |
+----------------------------+---------------+
Comparison of nonbinary string values (CHAR
, VARCHAR
, andTEXT
) that have a NO PAD
collation differ from other collations with respect to trailing spaces. For example, 'a'
and'a '
compare as different strings, not the same string. This can be seen using the binary collations for utf8mb4
. The pad attribute forutf8mb4_bin
is PAD SPACE
, whereas forutf8mb4_0900_bin
it is NO PAD
. Consequently, operations involvingutf8mb4_0900_bin
do not add trailing spaces, and comparisons involving strings with trailing spaces may differ for the two collations:
mysql> CREATE TABLE t1 (c CHAR(10) COLLATE utf8mb4_bin);
Query OK, 0 rows affected (0.03 sec)
mysql> INSERT INTO t1 VALUES('a');
Query OK, 1 row affected (0.01 sec)
mysql> SELECT * FROM t1 WHERE c = 'a ';
+------+
| c |
+------+
| a |
+------+
1 row in set (0.00 sec)
mysql> ALTER TABLE t1 MODIFY c CHAR(10) COLLATE utf8mb4_0900_bin;
Query OK, 0 rows affected (0.02 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> SELECT * FROM t1 WHERE c = 'a ';
Empty set (0.00 sec)
Language-Specific Collations
MySQL implements language-specific Unicode collations if the ordering based only on the Unicode Collation Algorithm (UCA) does not work well for a language. Language-specific collations are UCA-based, with additional language tailoring rules. Examples of such rules appear later in this section. For questions about particular language orderings,http://unicode.org provides Common Locale Data Repository (CLDR) collation charts athttp://www.unicode.org/cldr/charts/30/collation/index.html.
For example, the nonlanguage-specificutf8mb4_0900_ai_ci
and language-specificutf8mb4_ _`LOCALE`__0900_ai_ci
Unicode collations each have these characteristics:
- The collation is based on UCA 9.0.0 and CLDR v30, is accent-insensitive and case-insensitive. These characteristics are indicated by
_0900
,_ai
, and_ci
in the collation name. Exception:utf8mb4_la_0900_ai_ci
is not based on CLDR because Classical Latin is not defined in CLDR. - The collation works for all characters in the range [U+0, U+10FFFF].
- If the collation is not language specific, it sorts all characters, including supplementary characters, in default order (described following). If the collation is language specific, it sorts characters of the language correctly according to language-specific rules, and characters not in the language in default order.
- By default, the collation sorts characters having a code point listed in the DUCET table (Default Unicode Collation Element Table) according to the weight value assigned in the table. The collation sorts characters not having a code point listed in the DUCET table using their implicit weight value, which is constructed according to the UCA.
- For non-language-specific collations, characters in contraction sequences are treated as separate characters. For language-specific collations, contractions might change character sorting order.
A collation name that includes a locale code or language name shown in the following table is a language-specific collation. Unicode character sets may include collations for one or more of these languages.
Table 12.3 Unicode Collation Language Specifiers
Language | Language Specifier |
---|---|
Bosnian | bs |
Bulgarian | bg |
Chinese | zh |
Classical Latin | la or roman |
Croatian | hr or croatian |
Czech | cs or czech |
Danish | da or danish |
Esperanto | eo or esperanto |
Estonian | et or estonian |
Galician | gl |
German phone book order | de_pb or german2 |
Hungarian | hu or hungarian |
Icelandic | is or icelandic |
Japanese | ja |
Latvian | lv or latvian |
Lithuanian | lt or lithuanian |
Mongolian | mn |
Norwegian / Bokmål | nb |
Norwegian / Nynorsk | nn |
Persian | persian |
Polish | pl or polish |
Romanian | ro or romanian |
Russian | ru |
Serbian | sr |
Sinhala | sinhala |
Slovak | sk or slovak |
Slovenian | sl or slovenian |
Modern Spanish | es or spanish |
Traditional Spanish | es_trad or spanish2 |
Swedish | sv or swedish |
Turkish | tr or turkish |
Vietnamese | vi or vietnamese |
MySQL provides the Bulgarian collationsutf8mb4_bg_0900_ai_ci
andutf8mb4_bg_0900_as_cs
.
Croatian collations are tailored for these Croatian letters:Č
, Ć
,Dž
, Đ
,Lj
, Nj
,Š
, Ž
.
MySQL provides theutf8mb4_sr_latn_0900_ai_ci
andutf8mb4_sr_latn_0900_as_cs
collations for Serbian and the utf8mb4_bs_0900_ai_ci
andutf8mb4_bs_0900_as_cs
collations for Bosnian, when these languages are written with the Latin alphabet.
MySQL provides collations for both major varieties of Norwegian: for Bokmål, you can useutf8mb4_nb_0900_ai_ci
andutf8mb4_nb_0900_as_cs
; for Nynorsk, MySQL now provides utf8mb4_nn_0900_ai_ci
andutf8mb4_nn_0900_as_cs
.
For Japanese, the utf8mb4
character set includes utf8mb4_ja_0900_as_cs
andutf8mb4_ja_0900_as_cs_ks
collations. Both collations are accent-sensitive and case-sensitive.utf8mb4_ja_0900_as_cs_ks
is also kana-sensitive and distinguishes Katakana characters from Hiragana characters, whereasutf8mb4_ja_0900_as_cs
treats Katakana and Hiragana characters as equal for sorting. Applications that require a Japanese collation but not kana sensitivity may useutf8mb4_ja_0900_as_cs
for better sort performance. utf8mb4_ja_0900_as_cs
uses three weight levels for sorting;utf8mb4_ja_0900_as_cs_ks
uses four.
For Classical Latin collations that are accent-insensitive,I
and J
compare as equal, and U
and V
compare as equal. I
andJ
, and U
andV
compare as equal on the base letter level. In other words, J
is regarded as an accented I
, and U
is regarded as an accented V
.
MySQL provides collations for the Mongolian language when written with Cyrillic characters,utf8mb4_mn_cyrl_0900_ai_ci
andutf8mb4_mn_cyrl_0900_as_cs
.
Spanish collations are available for modern and traditional Spanish. For both, ñ
(n-tilde) is a separate letter between n
ando
. In addition, for traditional Spanish,ch
is a separate letter betweenc
and d
, andll
is a separate letter betweenl
and m
.
Traditional Spanish collations may also be used for Asturian and Galician. MySQL also providesutf8mb4_gl_0900_ai_ci
andutf8mb4_gl_0900_as_cs
collations for Galician. (These are the same collations asutf8mb4_es_0900_ai_ci
andutf8mb4_es_0900_as_cs
, respectively.)
Swedish collations include Swedish rules. For example, in Swedish, the following relationship holds, which is not something expected by a German or French speaker:
Ü = Y < Ö
_general_ci Versus _unicode_ci Collations
For any Unicode character set, operations performed using the_`xxx`__general_ci
collation are faster than those for the_`xxx`__unicode_ci
collation. For example, comparisons for theutf8mb4_general_ci
collation are faster, but slightly less correct, than comparisons forutf8mb4_unicode_ci
. The reason is thatutf8mb4_unicode_ci
supports mappings such as expansions; that is, when one character compares as equal to combinations of other characters. For example,ß
is equal to ss
in German and some other languages.utf8mb4_unicode_ci
also supports contractions and ignorable characters.utf8mb4_general_ci
is a legacy collation that does not support expansions, contractions, or ignorable characters. It can make only one-to-one comparisons between characters.
To further illustrate, the following equalities hold in bothutf8mb4_general_ci
andutf8mb4_unicode_ci
(for the effect of this in comparisons or searches, seeSection 12.8.6, “Examples of the Effect of Collation”):
Ä = A
Ö = O
Ü = U
A difference between the collations is that this is true forutf8mb4_general_ci
:
ß = s
Whereas this is true forutf8mb4_unicode_ci
, which supports the German DIN-1 ordering (also known as dictionary order):
ß = ss
MySQL implements language-specific Unicode collations if the ordering with utf8mb4_unicode_ci
does not work well for a language. For example,utf8mb4_unicode_ci
works fine for German dictionary order and French, so there is no need to create special utf8mb4
collations.
utf8mb4_general_ci
also is satisfactory for both German and French, except that ß
is equal to s
, and not toss
. If this is acceptable for your application, you should useutf8mb4_general_ci
because it is faster. If this is not acceptable (for example, if you require German dictionary order), use utf8mb4_unicode_ci
because it is more accurate.
If you require German DIN-2 (phone book) ordering, use theutf8mb4_german2_ci
collation, which compares the following sets of characters equal:
Ä = Æ = AE
Ö = Œ = OE
Ü = UE
ß = ss
utf8mb4_german2_ci
is similar tolatin1_german2_ci
, but the latter does not compare Æ
equal to AE
or Œ
equal to OE
. There is no utf8mb4_german_ci
corresponding tolatin1_german_ci
for German dictionary order because utf8mb4_general_ci
suffices.
Character Collating Weights
A character's collating weight is determined as follows:
- For all Unicode collations except the
_bin
(binary) collations, MySQL performs a table lookup to find a character's collating weight. - For
_bin
collations exceptutf8mb4_0900_bin
, the weight is based on the code point, possibly with leading zero bytes added. - For
utf8mb4_0900_bin
, the weight is theutf8mb4
encoding bytes. The sort order is the same as forutf8mb4_bin
, but much faster.
Collating weights can be displayed using theWEIGHT_STRING() function. (SeeSection 14.8, “String Functions and Operators”.) If a collation uses a weight lookup table, but a character is not in the table (for example, because it is a “new” character), collating weight determination becomes more complex:
- For BMP characters in general collations (
_`xxx`__general_ci
), the weight is the code point. - For BMP characters in UCA collations (for example,
_`xxx`__unicode_ci
and language-specific collations), the following algorithm applies:
if (code >= 0x3400 && code <= 0x4DB5)
base= 0xFB80; /* CJK Ideograph Extension */
else if (code >= 0x4E00 && code <= 0x9FA5)
base= 0xFB40; /* CJK Ideograph */
else
base= 0xFBC0; /* All other characters */
aaaa= base + (code >> 15);
bbbb= (code & 0x7FFF) | 0x8000;
The result is a sequence of two collating elements,aaaa
followed bybbbb
. For example:
mysql> SELECT HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci));
+----------------------------------------------------------+
| HEX(WEIGHT_STRING(_ucs2 0x04CF COLLATE ucs2_unicode_ci)) |
+----------------------------------------------------------+
| FBC084CF |
+----------------------------------------------------------+
Thus, U+04cf CYRILLIC SMALL LETTER PALOCHKA
(ӏ
) is, with all UCA 4.0.0 collations, greater than U+04c0 CYRILLIC LETTER PALOCHKA
(Ӏ
). With UCA 5.2.0 collations, all palochkas sort together.
- For supplementary characters in general collations, the weight is the weight for
0xfffd REPLACEMENT CHARACTER
. For supplementary characters in UCA 4.0.0 collations, their collating weight is0xfffd
. That is, to MySQL, all supplementary characters are equal to each other, and greater than almost all BMP characters.
An example with Deseret characters andCOUNT(DISTINCT)
:
CREATE TABLE t (s1 VARCHAR(5) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
INSERT INTO t VALUES (0xfffd); /* REPLACEMENT CHARACTER */
INSERT INTO t VALUES (0x010412); /* DESERET CAPITAL LETTER BEE */
INSERT INTO t VALUES (0x010413); /* DESERET CAPITAL LETTER TEE */
SELECT COUNT(DISTINCT s1) FROM t;
The result is 2 because in the MySQL_`xxx`__unicode_ci
collations, the replacement character has a weight of0x0dc6
, whereas Deseret Bee and Deseret Tee both have a weight of 0xfffd
. (Were the utf32_general_ci
collation used instead, the result is 1 because all three characters have a weight of 0xfffd
in that collation.)
An example with cuneiform characters andWEIGHT_STRING():
/*
The four characters in the INSERT string are
00000041 # LATIN CAPITAL LETTER A
0001218F # CUNEIFORM SIGN KAB
000121A7 # CUNEIFORM SIGN KISH
00000042 # LATIN CAPITAL LETTER B
*/
CREATE TABLE t (s1 CHAR(4) CHARACTER SET utf32 COLLATE utf32_unicode_ci);
INSERT INTO t VALUES (0x000000410001218f000121a700000042);
SELECT HEX(WEIGHT_STRING(s1)) FROM t;
The result is:
0E33 FFFD FFFD 0E4A
0E33
and 0E4A
are primary weights as inUCA 4.0.0. FFFD
is the weight for KAB and also for KISH.
The rule that all supplementary characters are equal to each other is nonoptimal but is not expected to cause trouble. These characters are very rare, so it is very rare that a multi-character string consists entirely of supplementary characters. In Japan, since the supplementary characters are obscure Kanji ideographs, the typical user does not care what order they are in, anyway. If you really want rows sorted by the MySQL rule and secondarily by code point value, it is easy:
ORDER BY s1 COLLATE utf32_unicode_ci, s1 COLLATE utf32_bin
- For supplementary characters based on UCA versions higher than 4.0.0 (for example,
_`xxx`__unicode_520_ci
), supplementary characters do not necessarily all have the same collating weight. Some have explicit weights from the UCAallkeys.txt
file. Others have weights calculated from this algorithm:
aaaa= base + (code >> 15);
bbbb= (code & 0x7FFF) | 0x8000;
There is a difference between “ordering by the character's code value” and “ordering by the character's binary representation,” a difference that appears only with utf16_bin
, because of surrogates.
Suppose that utf16_bin
(the binary collation for utf16
) was a binary comparison “byte by byte” rather than“character by character.” If that were so, the order of characters in utf16_bin
would differ from the order in utf8mb4_bin
. For example, the following chart shows two rare characters. The first character is in the rangeE000
-FFFF
, so it is greater than a surrogate but less than a supplementary. The second character is a supplementary.
Code point Character utf8mb4 utf16
---------- --------- ------- -----
0FF9D HALFWIDTH KATAKANA LETTER N EF BE 9D FF 9D
10384 UGARITIC LETTER DELTA F0 90 8E 84 D8 00 DF 84
The two characters in the chart are in order by code point value because 0xff9d
<0x10384
. And they are in order byutf8mb4
value because0xef
< 0xf0
. But they are not in order by utf16
value, if we use byte-by-byte comparison, because 0xff
>0xd8
.
So MySQL's utf16_bin
collation is not“byte by byte.” It is “by code point.” When MySQL sees a supplementary-character encoding in utf16
, it converts to the character's code-point value, and then compares. Therefore,utf8mb4_bin
andutf16_bin
are the same ordering. This is consistent with the SQL:2008 standard requirement for a UCS_BASIC collation: “UCS_BASIC is a collation in which the ordering is determined entirely by the Unicode scalar values of the characters in the strings being sorted. It is applicable to the UCS character repertoire. Since every character repertoire is a subset of the UCS repertoire, the UCS_BASIC collation is potentially applicable to every character set. NOTE 11: The Unicode scalar value of a character is its code point treated as an unsigned integer.”
If the character set is ucs2
, comparison is byte-by-byte, but ucs2
strings should not contain surrogates, anyway.
Miscellaneous Information
The_`xxx`__general_mysql500_ci
collations preserve the pre-5.1.24 ordering of the original_`xxx`__general_ci
collations and permit upgrades for tables created before MySQL 5.1.24 (Bug #27877).