Unicode 6.1.0 (original) (raw)

Released: 2012 January 31 (Announcement)

Version 6.1.0 has been superseded by the latest version of the Unicode Standard.

Unicode 6.1.0 is a minor version of the Unicode Standard. This page summarizes the important changes for the Unicode Standard, Version 6.1.0. In the discussion below, Version 6.1.0 may be abbreviated as "Unicode 6.1" or "Version 6.1."

Contents of This Document

A. Summary
B. Version Information
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Unicode Character Database Changes
G. Unicode Standard Annex Changes

A. Summary
Version 6.1 of the Unicode Standard continues the Unicode Consortium's long-term commitment to support the full diversity of languages around the world. This latest version adds characters to support additional languages of China, other Asian countries, and Africa. It also addresses educational needs in the Arabic-speaking world. A total of 732 new characters have been added.

This version of the Standard also brings technical improvements to support implementers. Improved changes to property values and their aliases mean that properties now have labels which are easier for systematic programmatic use. The new labels combined with a new script extensions property means that regular expressions can be more straightforward and are easier to validate. Hangul algorithms were consolidated and restructured. Before, one had to examine four separate documents. Now, the information is consolidated in the core specification in Chapter 3, Conformance.

Over 200 new Standardized Variants have been added for emoji characters, allowing implementations to distinguish preferred display styles between text and emoji styles. For example:

26FA FE0E TENT text style

26FA FE0F TENT emoji style

26FD FE0E FUEL PUMP text style

26FD FE0F FUEL PUMP emoji style

Among the notable property changes and additions in Unicode 6.1 are two new line break property values, which improve the line-breaking behavior of Hebrew and Japanese text. Segmentation behavior was also improved for Thai, Lao, and similar languages. The processing of Chinese data has been augmented by more fully specified information on mapping between Simplified and Traditional Chinese characters, in addition to other improved Unihan data that supports the processing of Chinese data.

For detailed property changes see Section F. Unicode Character Database Changes.

Version 6.1 has minor conformance updates, including the determination of grapheme cluster boundaries and the processing of combining canonical class and decomposition mapping. There are documentation improvements throughout.

Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and have updates for Version 6.1:

UTS #10, Unicode Collation Algorithm

UTS #46, Unicode IDNA Compatibility Processing

This version of the Unicode Standard is synchronized in repertoire with the forthcoming third edition of 10646: ISO/IEC 10646:2012.

B. Version Information
Version 6.1 of the Unicode Standard consists of the core specification, the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 6.1.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 6.1.0, (Mountain View, CA: The Unicode Consortium, 2012. ISBN 978-1-936213-02-3)
http://www.unicode.org/versions/Unicode6.1.0/

A complete specification of the contributory files for Unicode 6.1 is found on the pageComponents for 6.1.0.That page also provides the recommended reference format for Unicode Standard Annexes.

The navigation bar on the left of this page provides links to both the core specification as a single file, as well as to individual chapters, and the appendices. Also provided are links to thecode charts, theradical-stroke indices to CJK ideographs, the Unicode Standard Annexes and the data files for Version 6.1 of the Unicode Character Database.

Code Charts
Several sets of code charts are available. They serve different purposes:

The latest set of code charts for the Unicode Standard are available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.

For Unicode 6.1.0 in particular two additional sets of code chart pages are provided:

A set of delta code charts showing only the new blocks for Unicode 6.1.0 and any existing blocks for which new characters were added in Unicode 6.1.0. All new characters are visually highlighted in those charts.

A set of archival code charts that represent the entire set of characters, names and representative glyphs at the time of publication of Unicode 6.1.0.

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Errata
Errata incorporated into Unicode 6.1 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 6.1, see the list of current Updates and Errata.

C. Stability Policy Update
The stability policy which limits the range of possible Canonical_Combining_Class property values was narrowed to 0..254, from its former range of 0..255. This has the effect of permanently reserving the value 255, which can then be used by implementations for possible optimizations of table building.

Note: TheUnicode Character Encoding Stability Policy restricts possible future changes to the Unicode Standard, but is not formally a part of the standard itself.

D. Textual Changes and Character Additions
732 new character assignments were made to the Unicode Standard, Version 6.1. These additions bring the total number of characters assigned in the standard to 110,116. (That is the traditional count, which totals up graphic and format characters, but omits surrogate code points, ISO control codes, noncharacters, and private-use allocations.)

Character Assignment Overview
128 characters have been added to the BMP, while 604 characters have been added in the supplementary planes. Most character additions are in new blocks, but there are also character additions to a number of existing blocks.

New Blocks
The newly-defined blocks in Version 6.1 are:

08A0..08FF Arabic Extended-A

1CC0..1CCF Sundanese Supplement

AAE0..AAFF Meetei Mayek Extensions

10980..1099F Meroitic Hieroglyphs

109A0..109FF Meroitic Cursive

110D0..110FF Sora Sompeng

11100..1114F Chakma

11180..111DF Sharada

11680..116CF Takri

16F00..16F9F Miao

1EE00..1EEFF Arabic Mathematical Alphabetic Symbols

Text Changes and Additions
Numbers indicate the chapter or section in the Unicode 6.1 core specification where there are some significant changes or additions. This list is not exhaustive. Select changes to conformance requirements in Chapter 3, Conformance, that impact implementations are listed separately underE. Conformance Changes.

3.5: Updated the discussion of property values to clarify that some properties associate multiple values with each code point.

3.12: The discussion of Hangul syllable boundary determination was removed from this section. It now appears, instead, as a new section in UAX #29, "Unicode Text Segmentation".

3.12: The Java sample code exemplifying the Hangul-related algorithms was moved from UAX #15, "Unicode Normalization Forms" into this section, where it immediately follows the specifications of those algorithms.

3.12: The statements of the Hangul Decomposition, Hangul Composition, and Hangul Name Generation algorithms were cleaned up to give them a consistent presentation and better examples.

4.5: Additional text was added on the General_Category.

5.21: Rewrote text on Ignoring Characters in Processing.

6.2: Added text to the description of spaces.

8.2: Added new text on the Arabic Extended-A block, documented decomposition decisions involving Arabic letters with hamza-shaped diacritics, and updated the description of Arabic diacritic marks.

8.3: Clarified the text on Syriac shaping behavior.

9.1: Made various updates to Devanagari to document Vedic extensions, Vedic use of U+20F0 COMBINING ASTERISK ABOVE, and Devanagari short vowels.

9.2: Added information on the history of Assamese and Bengali.

10.9: Added new text for Sharada.

10.10: Added new text for Takri.

10.11: Added new text for Chakma.

10.12: Added new text for Meetei Mayak Extensions.

10.14: Added new text for Sora Sompeng.

12.2: Updated text to modify the syntax for ideograph variation sequences, removing length constraints and allowing the use of private use characters.

13.4: Clarified the function of ZWJ in the context of display of bi-consonants in Tifinagh.

13.13: Added new text for Miao.

14.19: Added new text for Meroitic Hieroglyphs and Meroitic Cursive.

15.2: Added new text on Arabic Mathematical Alphabetic Symbols.

15.3: Rewrote discussion of number forms to expand it to cover all aspects of the encoding of numerals in the standard.

15.4: Created a new section devoted to superscript and subscript symbols.

15.8: Made various updates to the text in Miscellaneous Symbols regarding playing cards and Phaistos Disc symbols.

16.4: Updated the text on Variation Selectors

16.5: Extended the discussion of properties for private-use characters.

17.1: Updated information on the use and display of formal name aliases in the code charts.

Appendix E was updated with more details of the early history of Chinese character standardization.

E. Conformance Changes
There are several changes to conformance requirements in Unicode 6.1 that impact implementations. The most important of these are:

Several bullets were updated under definitions D58, D59, and D60, to further clarify the relationship between "grapheme base", the Grapheme_Base property, and the specification of grapheme cluster boundaries in UAX #29, "Unicode Text Segmentation".

Clarifications were added under D107 and D110 in Section 3.11 to make it clear that private agreements cannot override the Canonical_Combining_Class or Decomposition_Mapping of private use characters.

The text regarding tailored casing at the beginning of Section 3.13 was corrected, to properly indicate which kinds of tailorings are covered in SpecialCasing.txt and which by CLDR.

In Section 4.8 the description of formal name aliases has been updated to account for new types of aliases which are now formally defined in NameAliases.txt in the UCD.

F. Unicode Character Database Changes
The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 6.1 can be found in UAX #44, Unicode Character Database. The changes listed there include a number of important property revisions to existing characters that will affect implementations:

Five characters (U+00A7, U+00B6, U+0F14, U+1360, and U+10102) had their General_Category changed from So to Po, to assist in cleaner tailoring of the relative order of symbols and punctuation for the Unicode Collation Algorithm.

Eight relatively recently added numeric symbols (U+3248..U+324F) had their General_Category changed from So to No, to make them more consistent with similar symbols consisting of numbers surrounded by a circle or a square. Neither this change or the change from gc=So to gc=Po affects the derivation of identifier-related properties, but may impact assumptions about these characters in some implementations.

The default Bidi_Class for two ranges, U+08A0..U+08FF and U+1EE00..U+1EEFF, has been changed from bc=R to bc=AL, because the new blocks for those ranges now contain Arabic characters. Check that default Bidi_Class settings for those ranges are updated accordingly in property tables and in implementations of the Unicode Bidirectional Algorithm.

Two new Line_Break property values have been added. The first is for Hebrew letters: lb=HL. It is used in the definition of a new rule, LB21a, in UAX #14, for handling line breaking for Hebrew characters next to hyphens. The second, lb=CJ, allows for better customization of Japanese line breaking. Implementations of Unicode line breaking may need to be updated to correctly handle these additional line break property values.

The kTraditionalVariant and kSimplifiedVariant tags and their usage in the Unihan Database have been more fully specified. Implementations which use that data to do simplified/traditional mapping of CJK characters may need to be updated.

The meanings of the kMandarin and kTotalStrokes tags in the Unihan Database have been more fully specified to focus on the use in collation and (for the former) transliteration, and the values of each property have changed very significantly, with values for many more characters. Implementations which use that data may need to be updated.

Every value for enumerated and catalog properties now has both a short and a long alias. There are no more "n/a" placeholders indicating the absence of a short property value alias. In addition, the long aliases are all suitable for programmatic identifiers. This change affected the Age, Block, Canonical_Combining_Class, Indic_Matra_Category, Indic_Syllabic_Category, and Joining_Group properties.

The Name_Alias property has added over 300 new values, most of which are common aliases for control characters. For the first time, a character may have more than one character name alias. The existence of multiple character name aliases for a single character may affect implementations.

Script_Extensions has been added as a new, provisional property, providing finer-grained information for determining the script of runs of text. Implementations may need to be upgraded to take advantage of this information.

214 standardized variation sequences have been added for emoji characters, allowing implementations to distinguish between two preferred display styles: text style versus emoji style.

The Grapheme_Cluster_Break property values have been modified to produce better segmentation results for Thai, Lao, and similar scripts.

Other significant changes resulting from the addition of new characters include:

An additional unified ideograph has been added to the main BMP block of CJK unified ideographs: U+9FCC. This extends the range of those CJK unified ideographs by one value. Check implementations for any hard-coded assumptions about the ranges of CJK unified ideographs.

Two new Chakma characters, U+1112E and U+1112F, have canonical decompositions. This is unusual for characters off the BMP, and may break certain assumptions used in optimization of implementations of Unicode Normalization. Check that any hard coded assumptions about normalization take these characters into account, and that the characters correctly recompose for NFC.

The addition of the two Chakma characters with canonical decompositions may also impact implementations of the Unicode Collation Algorithm. These two characters introduce new weight contractions, and for the first time the second element of those contractions is a supplementary character. These are also the first instances where the representation of the contraction in UTF-16 is longer than three code units. These changes may impact optimization assumptions in UCA implementations.

Other significant changes to the text of the core specification or annexes which may impact implementation include:

The Syriac shaping rules specified in Section 8.3, Syriac, of the core specification have been clarified, so that it is clear that the term "dalath or rish" refers to characters with Joining_Group=Dalath_Rish. Also "word breaking character" in the alaph joining rules has been corrected to "non-joining character". Implementers with Syriac shaping engines should check to ensure that their implementations are consistent with those clarifications.

The list of scripts recommended for inclusion in or exclusion from identifiers has been updated in UAX #31. That list is not available in machine-readable form in the UCD, so implementations which tailor their identifier usage according to the UAX #31 recommendations will need to refer specifically to that annex for updates.

G. Unicode Standard Annex Changes
In Version 6.1, many of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex Changes

UAX #9Unicode Bidirectional Algorithm No significant changes in this version.

UAX #11East Asian Width No significant changes in this version.

UAX #14Unicode Line Breaking Algorithm Rule 21a was added, to prevent a break between a Hebrew letter and a following hyphen, and added the character class HL (Hebrew Letter) for that rule. Small kana were moved from class NS to class ID, to align Japanese "kinsoku" more closely with CSS "normal" behavior.

UAX #15Unicode Normalization Forms An implementation note on the use of ccc=255 was added. The example code and description of Hangul decomposition and composition was moved into Section 3.12, Conjoining Jamo Behavior in the core specification. Section 14.1, Optimization Strategies was rewritten for clarity.

UAX #24Unicode Script Property The former Section 4.1 on Script Anomalies for East Asian Symbols was moved to become Section 3.6, and the examples were extended to cover additional unexpected script values for symbols. A description was added for the new property Script_Extensions.

UAX #29Unicode Text Segmentation The discussion of Hangul Syllable segmentation was moved from the Core Specification to this annex and its wording updated slightly. The handling of the Prepend and SpacingMark class was adjusted so that for the Thai and Lao scripts extended grapheme clusters behave like legacy grapheme clusters, as preferred. Characters with gc=Cs and gc=Cn were added to Control in Table 2, so that they do not join with following Extend characters for defining grapheme cluster boundaries.

UAX #31Unicode Identifier and Pattern Syntax New scripts were added to the tables categorizing script usage. Material was added to draw the distinction between the format of identifiers for internal use and the format of identifiers for display. Better guidance was provided on the use of variation sequences.

UAX #34Unicode Named Character Sequences No significant changes in this version.

UAX #38Unicode Han Database (Unihan) The kTotalStrokes and kMandarin fields were redefined. The use of the kTraditionalVariant and kSimplifiedVariant fields were clarified. A new section 4.4 was added, detailing the ranges of CJK ideographs covered by the Unihan database, with their associated Unicode age values. Each Unihan property that can have multiple values had a specification added to indicate whether the order of values matters, and if so, what the significance of that order is. The regex validity expressions were slightly simplified.

UAX #41Common References for Unicode Standard Annexes The references were updated as needed.

UAX #42Unicode Character Database in XML New values were added for the age, script, and jg attributes. The values for the ccc attribute were restricted to the 0..254 range, instead of 0..255. The patterns for kIRG_USource and kMandarin were updated to reflect changes in the Unihan database. A new element was added for the Name_Alias property, and new attributes were added for the Block and Script_Extensions properties. A clarification was added to distinguish attributes with empty string values from missing attributes. In particular, the absence of a numeric value is now represented by NaN. The value of the fc_nfkc attribute must now be either # or one-or-more-code-points.

UAX #44Unicode Character Database Text was added regarding the reserved value 255 for Canonical_Combining_Class. Grouped values for General_Category were added to the table of values for that property. The status and description of Grapheme_Base and Grapheme_Extend were updated. The tables of regular expressions for validation of property values were updated. An entry was added to the Property Table for the new Script_Extensions provisional property. The description of the Name_Alias property was updated. A new section describing multivalued properties was added. There are various other small editorial fixes to the text.

26FA FE0E		TENT text style
26FA FE0F		TENT emoji style
26FD FE0E		FUEL PUMP text style
26FD FE0F		FUEL PUMP emoji style

08A0..08FF	Arabic Extended-A
1CC0..1CCF	Sundanese Supplement
AAE0..AAFF	Meetei Mayek Extensions
10980..1099F	Meroitic Hieroglyphs
109A0..109FF	Meroitic Cursive
110D0..110FF	Sora Sompeng
11100..1114F	Chakma
11180..111DF	Sharada
11680..116CF	Takri
16F00..16F9F	Miao
1EE00..1EEFF	Arabic Mathematical Alphabetic Symbols

Unicode Standard Annex	Changes
UAX #9Unicode Bidirectional Algorithm	No significant changes in this version.
UAX #11East Asian Width	No significant changes in this version.
UAX #14Unicode Line Breaking Algorithm	Rule 21a was added, to prevent a break between a Hebrew letter and a following hyphen, and added the character class HL (Hebrew Letter) for that rule. Small kana were moved from class NS to class ID, to align Japanese "kinsoku" more closely with CSS "normal" behavior.
UAX #15Unicode Normalization Forms	An implementation note on the use of ccc=255 was added. The example code and description of Hangul decomposition and composition was moved into Section 3.12, Conjoining Jamo Behavior in the core specification. Section 14.1, Optimization Strategies was rewritten for clarity.
UAX #24Unicode Script Property	The former Section 4.1 on Script Anomalies for East Asian Symbols was moved to become Section 3.6, and the examples were extended to cover additional unexpected script values for symbols. A description was added for the new property Script_Extensions.
UAX #29Unicode Text Segmentation	The discussion of Hangul Syllable segmentation was moved from the Core Specification to this annex and its wording updated slightly. The handling of the Prepend and SpacingMark class was adjusted so that for the Thai and Lao scripts extended grapheme clusters behave like legacy grapheme clusters, as preferred. Characters with gc=Cs and gc=Cn were added to Control in Table 2, so that they do not join with following Extend characters for defining grapheme cluster boundaries.
UAX #31Unicode Identifier and Pattern Syntax	New scripts were added to the tables categorizing script usage. Material was added to draw the distinction between the format of identifiers for internal use and the format of identifiers for display. Better guidance was provided on the use of variation sequences.
UAX #34Unicode Named Character Sequences	No significant changes in this version.
UAX #38Unicode Han Database (Unihan)	The kTotalStrokes and kMandarin fields were redefined. The use of the kTraditionalVariant and kSimplifiedVariant fields were clarified. A new section 4.4 was added, detailing the ranges of CJK ideographs covered by the Unihan database, with their associated Unicode age values. Each Unihan property that can have multiple values had a specification added to indicate whether the order of values matters, and if so, what the significance of that order is. The regex validity expressions were slightly simplified.
UAX #41Common References for Unicode Standard Annexes	The references were updated as needed.
UAX #42Unicode Character Database in XML	New values were added for the age, script, and jg attributes. The values for the ccc attribute were restricted to the 0..254 range, instead of 0..255. The patterns for kIRG_USource and kMandarin were updated to reflect changes in the Unihan database. A new element was added for the Name_Alias property, and new attributes were added for the Block and Script_Extensions properties. A clarification was added to distinguish attributes with empty string values from missing attributes. In particular, the absence of a numeric value is now represented by NaN. The value of the fc_nfkc attribute must now be either # or one-or-more-code-points.
UAX #44Unicode Character Database	Text was added regarding the reserved value 255 for Canonical_Combining_Class. Grouped values for General_Category were added to the table of values for that property. The status and description of Grapheme_Base and Grapheme_Extend were updated. The tables of regular expressions for validation of property values were updated. An entry was added to the Property Table for the new Script_Extensions provisional property. The description of the Name_Alias property was updated. A new section describing multivalued properties was added. There are various other small editorial fixes to the text.