12.13 Adding a Character Set (original) (raw)

This section discusses the procedure for adding a character set to MySQL. The proper procedure depends on whether the character set is simple or complex:

For example, greek and swe7 are simple character sets, whereas big5 andczech are complex character sets.

To use the following instructions, you must have a MySQL source distribution. In the instructions,MYSET represents the name of the character set that you want to add.

  1. Add a <charset> element for_MYSET_ to thesql/share/charsets/Index.xml file. Use the existing contents in the file as a guide to adding new contents. A partial listing for the latin1 <charset> element follows:
<charset name="latin1">  
  <family>Western</family>  
  <description>cp1252 West European</description>  
  ...  
  <collation name="latin1_swedish_ci" id="8" order="Finnish, Swedish">  
    <flag>primary</flag>  
    <flag>compiled</flag>  
  </collation>  
  <collation name="latin1_danish_ci" id="15" order="Danish"/>  
  ...  
  <collation name="latin1_bin" id="47" order="Binary">  
    <flag>binary</flag>  
    <flag>compiled</flag>  
  </collation>  
  ...  
</charset>  

The <charset> element must list all the collations for the character set. These must include at least a binary collation and a default (primary) collation. The default collation is often named using a suffix ofgeneral_ci (general, case-insensitive). It is possible for the binary collation to be the default collation, but usually they are different. The default collation should have a primary flag. The binary collation should have a binary flag.
You must assign a unique ID number to each collation. The range of IDs from 1024 to 2047 is reserved for user-defined collations. To find the maximum of the currently used collation IDs, use this query:

SELECT MAX(ID) FROM INFORMATION_SCHEMA.COLLATIONS;  
  1. This step depends on whether you are adding a simple or complex character set. A simple character set requires only a configuration file, whereas a complex character set requires C source file that defines collation functions, multibyte functions, or both.
    For a simple character set, create a configuration file,_`MYSET`_.xml, that describes the character set properties. Create this file in the sql/share/charsets directory. You can use a copy of latin1.xml as the basis for this file. The syntax for the file is very simple:
    • Comments are written as ordinary XML comments (<!-- _`text`_ -->).
    • Words within <map> array elements are separated by arbitrary amounts of whitespace.
    • Each word within <map> array elements must be a number in hexadecimal format.
    • The <map> array element for the<ctype> element has 257 words. The other <map> array elements after that have 256 words. SeeSection 12.13.1, “Character Definition Arrays”.
    • For each collation listed in the<charset> element for the character set in Index.xml,_`MYSET`_.xml must contain a <collation> element that defines the character ordering.
      For a complex character set, create a C source file that describes the character set properties and defines the support routines necessary to properly perform operations on the character set:
    • Create the filectype-_`MYSET`_.c in the strings directory. Look at one of the existing ctype-*.c files (such as ctype-big5.c) to see what needs to be defined. The arrays in your file must have names likectype_ _`MYSET`_,to_lower_ _`MYSET`_, and so on. These correspond to the arrays for a simple character set. See Section 12.13.1, “Character Definition Arrays”.
    • For each <collation> element listed in the <charset> element for the character set in Index.xml, thectype-_`MYSET`_.c file must provide an implementation of the collation.
    • If the character set requires string collating functions, see Section 12.13.2, “String Collating Support for Complex Character Sets”.
    • If the character set requires multibyte character support, see Section 12.13.3, “Multi-Byte Character Support for Complex Character Sets”.
  2. Modify the configuration information. Use the existing configuration information as a guide to adding information for_MYSYS_. The example here assumes that the character set has default and binary collations, but more lines are needed if MYSET has additional collations.
    1. Edit mysys/charset-def.c, and“register” the collations for the new character set.
      Add these lines to the “declaration” section:
    #ifdef HAVE_CHARSET_MYSET  
    extern CHARSET_INFO my_charset_MYSET_general_ci;  
    extern CHARSET_INFO my_charset_MYSET_bin;  
    #endif  

    Add these lines to the “registration” section:

    #ifdef HAVE_CHARSET_MYSET  
      add_compiled_collation(&my_charset_MYSET_general_ci);  
      add_compiled_collation(&my_charset_MYSET_bin);  
    #endif  
    1. If the character set usesctype-_`MYSET`_.c, edit strings/CMakeLists.txt and addctype-_`MYSET`_.c to the definition of theSTRINGS_SOURCES variable.
    2. Edit cmake/character_sets.cmake:
      1. Add MYSET to the value of with CHARSETS_AVAILABLE in alphabetic order.
      2. Add MYSET to the value ofCHARSETS_COMPLEX in alphabetic order. This is needed even for simple character sets, so that CMake can recognize-DDEFAULT_CHARSET=MYSET.
  3. Reconfigure, recompile, and test.