Unicode 8.0.0 (original) (raw)
Released: 2015 June 17 (Announcement)
Version 8.0.0 has been superseded by the latest version of the Unicode Standard.
This page summarizes the important changes for the Unicode Standard, Version 8.0.0. This version supersedes all previous versions of the Unicode Standard.
A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for MigrationA. Summary
Unicode 8.0 adds a total of 7,716 characters, encompassing six new scripts and many new symbols, as well as character additions to several existing scripts. Notable character additions include the following:
- A set of lowercase Cherokee syllables, forming case pairs with the existing Cherokee characters
- A large collection of CJK unified ideographs
- Emoji symbols and symbol modifiers for implementing skin tone diversity; see Unicode Emoji.
- Georgian lari currency symbol
- Letters to support the Ik language in Uganda, Kulango in the Côte d’Ivoire, and other languages of Africa
- The Ahom script for support of the Tai Ahom language in India
- Arabic letters to support Arwi—the Tamil language written in the Arabic script
Other important updates in Unicode Version 8.0 include:
- Change in encoding model of New Tai Lue to visual order
Synchronization
Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and include updates for the repertoire additions made in Version 8.0, as well as other modifications:
This version of the Unicode Standard is synchronized with ISO/IEC 10646:2014, plus Amendment 1. Additionally, it includes the accelerated publication of U+20BE LARI SIGN, nine CJK unified ideographs (U+9FCD..U+9FD5), and 41 emoji characters.
See Sections D through H below for additional details regarding the changes in this version of the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.
B. Technical Overview
Version 8.0 of the Unicode Standard consists of the core specification (download), the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).
The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.
A complete specification of the contributory files for Unicode 8.0 is found on the page Components for 8.0.0. That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.
The navigation bar on the left of this page provides links to both the core specification as a single file, as well as to individual chapters, and the appendices. Also provided are links to the code charts, the radical-stroke indices to CJK ideographs, the Unicode Standard Annexes and the data files for Version 8.0 of the Unicode Character Database.
Version Specification
Version 8.0.0 of the Unicode Standard should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 8.0.0, (Mountain View, CA: The Unicode Consortium, 2015. ISBN 978-1-936213-10-8)
http://www.unicode.org/versions/Unicode8.0.0/The terms “Version 8.0” or “Unicode 8.0” are abbreviations for the full version reference, Version 8.0.0.
The citation and permalink for the latest published version of the Unicode Standard is:
The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/Code Charts
Several sets of code charts are available. They serve different purposes:
- The latest set of code charts for the Unicode Standard is available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.
For Unicode 8.0.0 in particular two additional sets of code chart pages are provided:
- A set of delta code charts showing the blocks in which characters were added for Unicode 8.0.0. The new characters are visually highlighted in the charts.
- A set of archival code charts that represents the entire set of characters, names and representative glyphs at the time of publication of Unicode 8.0.0.
The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.
Errata
Errata incorporated into Unicode 8.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 8.0, see the list of current Updates and Errata.
C. Stability Policy Update
- The case folding stability policy has been clarified to account for the fact that the case folding target may be either lowercase or uppercase.
- A new constraint has been added to the property value stability policy to ensure that whenever a character has an explicit Script value (that is, a value other than Common, Inherited, or Unknown), then that same explicit Script value must also be contained among the set of values constituting the Script_Extensions property value for that character.
D. Textual Changes and Character Additions
Six new scripts were added with accompanying new block descriptions:
Ahom Anatolian Hieroglyphs Hatran Multani Old Hungarian Sutton SignWriting Letters used in Arabic and in a number of modern and historic writing systems of South Asia were added. Version 8.0 also has a new notational system, Sutton SignWriting, used for transcription of various sign languages.
A number of popular emoji and other pictographic symbols are now included, as well as a mechanism for supporting diversity in emoji representing faces or people. More user interface symbols were also added to the standard.
Changes in the Unicode Standard Annexes are listed in Section G.
Character Assignment Overview
7,716 characters have been added, including 5,771 CJK unified ideographs. Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see Delta Code Charts.
New Blocks
The newly-defined blocks in Version 8.0 are:
Range Block Name AB70..ABBF Cherokee Supplement 108E0..108FF Hatran 10C80..10CFF Old Hungarian 11280..112AF Multani 11700..1173F Ahom 12480..1254F Early Dynastic Cuneiform 14400..1467F Anatolian Hieroglyphs 1D800..1DAAF Sutton SignWriting 1F900..1F9FF Supplemental Symbols and Pictographs 2B820..2CEAF CJK Unified Ideographs Extension E E. Conformance Changes
There were no significant changes to the conformance clauses of the core specification for Unicode 8.0. However, there were minor changes to the rules in the algorithms specified in UAX #9, UAX #14, and UAX #29. Those rule changes will impact conformant implementations of the respective algorithms. See Section G. Changes in the Unicode Standard Annexes.
F. Changes in the Unicode Character Database
The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 8.0 can be found in UAX #44, Unicode Character Database. The changes listed there include character additions and property revisions to existing characters that will affect implementations. Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in Section M.
- The Script property of the Arabic-Indic digits (U+0660..U+0669) changed from "Common" to "Arabic".
- In the Unihan data files, over 2,800 values of the normative kIRG_JSource field were updated to more contemporary Japanese source references.
- The General_Category property of existing Cherokee syllables changed from Other_Letter to Uppercase_Letter, in accordance with the conversion of Cherokee to a bicameral script.
- The General_Category property of New Tai Lue vowel signs and tone marks changed from Spacing_Mark to Other_Letter, and the Logical_Order_Exception binary property was set for pre-base vowel signs, reflecting the change in encoding model.
- The properties Indic_Syllabic_Category and Indic_Positional_Category (renamed from Indic_Matra_Category) were promoted from provisional to informative status.
- The Indic_Syllabic_Category property was expanded with four new property values and multiple assignments.
- Numerous existing characters of Brahmi-derived scripts were assigned Indic_Positional_Category property values.
- The tag characters U+E0020..U+E007E were un-deprecated.
- The Line_Break property value of U+22EF MIDLINE HORIZONTAL ELLIPSIS was changed to prevent line breaks between ideographic characters and U+22EF.
- The Joining_Type property values of two Mandaic letters were corrected.
G. Changes in the Unicode Standard Annexes
In Version 8.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.
Unicode Standard Annex Changes UAX #9Unicode Bidirectional Algorithm Introduced minor changes to handle identified edge cases in the algorithm. UAX #11East Asian Width No significant changes in this version. UAX #14Unicode Line Breaking Algorithm Added rule LB21b, to prevent a break between a solidus and Hebrew letters. Added a case to rule LB22, to prevent a break between exclamation marks and ellipses. UAX #15Unicode Normalization Forms Updated Section 1.1 and associated figures regarding canonical and compatibility equivalence. UAX #24Unicode Script Property Added clarifications to the principle used in determining the assignment of Script property values for characters predominantly used in one script, but also employed in other scripts. UAX #29Unicode Text Segmentation Adjusted the Default Grapheme Cluster Boundary specification, to account for changes to the New Tai Lue model. Modified rule SB7 to prevent sentence breaks within a word segment such as "Mr.Hamster". UAX #31Unicode Identifier and Pattern Syntax Added clarifications regarding case distinctions in identifiers in programming languages. Added text about middle dot in identifiers. Added new scripts to the list of characters for exclusions from identifiers. UAX #34Unicode Named Character Sequences No significant changes in this version. UAX #38Unicode Han Database (Unihan) Added documentation of the new kJa field and updated the description of the KIRG_JSource field. Updated the description of the kDefaultSortKey field to clarify its intended use. Updated the regular expression for the kIRG_GSource field. UAX #41Common References for Unicode Standard Annexes No significant changes in this version. UAX #42Unicode Character Database in XML Added new code point attributes, values, and patterns. UAX #44 Unicode Character Database Updated the documentation of the 7.0 provisional Indic data files to reflect their new status as informative in Unicode 8.0 and the renaming of the provisional Indic_Matra_Category to the informative Indic_Positional_Category. UAX #45 U-Source Ideographs UAX #45 did not change for this release, however three characters were added to the entries in USourceData.txt and USourceGlyphs.pdf. H. Changes in Synchronized Unicode Technical Standards
There are also significant revisions in the Unicode Technical Standards whose versions are synchronized with the Unicode Standard. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UTS, linked directly from the following list of UTSes.
Unicode Technical Standard Changes UTS #10Unicode Collation Algorithm Contractions for most Cyrillic accented letters have been removed from the DUCET. UTS #46Unicode IDNA Compatibility Processing Values in the IDNA Comparisons table were updated for Unicode 8.0. A note was added to clarify the implications of Cherokee case folding for mapping in IDNA. UTS #39, Security Mechanisms, has also been updated for Version 8.0.
M. Implications for Migration
There are a significant number of changes in Unicode 8.0 which may impact implementations which are upgrading to Version 8.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.
Casing and Case Folding of Cherokee
The character encoding model for the Cherokee script changed from unicameral to bicameral. The conversion was done by reclassifying all existing syllables as uppercase and adding a corresponding set of lowercase syllables. In terms of properties, the General_Category of the existing characters changed from Other_Letter to Uppercase_Letter, and the new characters were given the value Lowercase_Letter. A new case pair for the archaic syllable mv was also added.
The casing was chosen in order to reduce the migration cost for implementations, allowing them to preserve the font metrics for the existing characters and reduce the implications on layout. However, the formation of case pairs by adding lowercase characters is unusual. As a result, case folding of Cherokee maps to uppercase instead of lowercase. This mapping also has consequences on identifiers, as described in the changes to UAX #31, Unicode Identifier and Pattern Syntax.
Change in Encoding Model for New Tai Lue to Visual Order
The character encoding model for New Tai Lue changed from logical order, in which pre-base vowels are stored after an initial consonant, to visual order, in which the pre-base vowels are stored before the initial consonant, as for Thai, Lao, and Tai Viet. The model was changed to better serve the primary user community in the Xishuangbanna region of China, who have been accumulating data input and stored in visual order, and have been using fonts with a visual order encoding to render it.
The encoding model change incurred a uniform General_Category reclassification of all New Tai Lue vowels signs and tone marks from Spacing_Mark to Other_Letter, the assignment of the property value Logical_Order_Exception=Yes to the pre-base vowels U+19B5..U+19B7 and U+19BA, and the addition of 176 pre-base vowel + initial consonant contractions to the Default Unicode Collation Element Table.
A visual order model complicates syllable identification and the processes for searching and sorting. Implementations switching to the visual order model can take advantage of techniques developed for processing Thai script data to address the issues associated with visual order encoding, and data stored in logical order should be carefully migrated.
Other Script-related Changes
Version 8.0 adds six new scripts, so implementations which process script data should be carefully checked.
Additionally, there was a significant Script property value change affecting the common Arabic-Indic digits (U+0660..U+0669). These were changed from having the value "Common" to the value "Arabic". Their use with scripts other than Arabic is now more consistently dealt with by the Script_Extensions property, instead. Implementations which may have had special treatment for the Script property value of the Arabic-Indic digits should be checked to ensure that the change in Script property value does not cause unexpected behavior.
Changes for Deprecation of Language Tags
The range of tag characters (U+E0020..U+E007E) was changed from Deprecated=True to Deprecated=False in Version 8.0. This change was done to clear the way for the potential future use of tag characters for a purpose other than to represent language tags.
Note that two characters, U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG, remain deprecated. Furthermore, the use of tag characters to represent language tags in a plain text stream is still a deprecated mechanism for conveying language information about text.
Implementations which deliberately remove or refuse to interpret deprecated characters may need updates to prepare them for the potential use of U+E0020..U+E007E in Unicode 8.0 data in the future.
Glyph Changes
The representative glyph for U+301C WAVE DASH was updated, so that it now shows a tilde shape instead of a reversed tilde shape. The updated glyph now aligns with majority practice in fonts for this character.
The representative glyph for U+3127 BOPOMOFO LETTER I was changed from a vertical orientation to a horizontal orientation. The updated glyph now aligns with majority practice for both horizontal and vertical layout of Bopomofo text, but implementations should be checked to verify correct behavior for rendering of this character.
Segmentation-related Changes
Version 8.0 made small adjustments to line break and other segmentation rules. In particular:
- In UAX #14, the line break opportunities were adjusted for Hebrew letters adjacent to a solidus "/" and to prevent breaks between exclamation marks and ellipses.
- In UAX #29, the default grapheme cluster segmentation was adjusted for the New Tai Lue script. This is part of the overall model change for New Tai Lue.
- In UAX #29, the default sentence segmentation rules were adjusted to prevent spurious sentence breaks at full stops occurring interior to words. This is colloquially known as the "Mr.Hamster" problem.
CJK Changes
- A collection of 5,762 CJK unified ideographs were encoded in a new block, CJK Unified Ideographs Extension E, and nine other ideographs were added to the CJK Unified Ideographs block.
- Over 2,800 values of the normative kIRG_JSource field were updated to reflect the more contemporary JIS X 0213:2004 (J3, J3A, and J4) source references, replacing outdated JIS X 0212-1990 (J1) and "Unified Japanese IT Vendors Contemporary Ideographs" (JA) source references.