Unicode 8.0.0 (original) (raw)

Released: 2015 June 17 (Announcement)

Version 8.0.0 has been superseded by the latest version of the Unicode Standard.

This page summarizes the important changes for the Unicode Standard, Version 8.0.0. This version supersedes all previous versions of the Unicode Standard.

A. Summary
B. Technical Overview
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards
M. Implications for Migration

A. Summary

Unicode 8.0 adds a total of 7,716 characters, encompassing six new scripts and many new symbols, as well as character additions to several existing scripts. Notable character additions include the following:

Other important updates in Unicode Version 8.0 include:

Synchronization

Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and include updates for the repertoire additions made in Version 8.0, as well as other modifications:

This version of the Unicode Standard is synchronized with ISO/IEC 10646:2014, plus Amendment 1. Additionally, it includes the accelerated publication of U+20BE LARI SIGN, nine CJK unified ideographs (U+9FCD..U+9FD5), and 41 emoji characters.

See Sections D through H below for additional details regarding the changes in this version of the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.

B. Technical Overview

Version 8.0 of the Unicode Standard consists of the core specification (download), the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

A complete specification of the contributory files for Unicode 8.0 is found on the page Components for 8.0.0. That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.

The navigation bar on the left of this page provides links to both the core specification as a single file, as well as to individual chapters, and the appendices. Also provided are links to the code charts, the radical-stroke indices to CJK ideographs, the Unicode Standard Annexes and the data files for Version 8.0 of the Unicode Character Database.

Version Specification

Version 8.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 8.0.0, (Mountain View, CA: The Unicode Consortium, 2015. ISBN 978-1-936213-10-8)
http://www.unicode.org/versions/Unicode8.0.0/

The terms “Version 8.0” or “Unicode 8.0” are abbreviations for the full version reference, Version 8.0.0.

The citation and permalink for the latest published version of the Unicode Standard is:

The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/

Code Charts

Several sets of code charts are available. They serve different purposes:

For Unicode 8.0.0 in particular two additional sets of code chart pages are provided:

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Errata

Errata incorporated into Unicode 8.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 8.0, see the list of current Updates and Errata.

C. Stability Policy Update

D. Textual Changes and Character Additions

Six new scripts were added with accompanying new block descriptions:

Ahom
Anatolian Hieroglyphs
Hatran
Multani
Old Hungarian
Sutton SignWriting

Letters used in Arabic and in a number of modern and historic writing systems of South Asia were added. Version 8.0 also has a new notational system, Sutton SignWriting, used for transcription of various sign languages.

A number of popular emoji and other pictographic symbols are now included, as well as a mechanism for supporting diversity in emoji representing faces or people. More user interface symbols were also added to the standard.

Changes in the Unicode Standard Annexes are listed in Section G.

Character Assignment Overview

7,716 characters have been added, including 5,771 CJK unified ideographs. Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see Delta Code Charts.

New Blocks

The newly-defined blocks in Version 8.0 are:

Range Block Name
AB70..ABBF Cherokee Supplement
108E0..108FF Hatran
10C80..10CFF Old Hungarian
11280..112AF Multani
11700..1173F Ahom
12480..1254F Early Dynastic Cuneiform
14400..1467F Anatolian Hieroglyphs
1D800..1DAAF Sutton SignWriting
1F900..1F9FF Supplemental Symbols and Pictographs
2B820..2CEAF CJK Unified Ideographs Extension E

E. Conformance Changes

There were no significant changes to the conformance clauses of the core specification for Unicode 8.0. However, there were minor changes to the rules in the algorithms specified in UAX #9, UAX #14, and UAX #29. Those rule changes will impact conformant implementations of the respective algorithms. See Section G. Changes in the Unicode Standard Annexes.

F. Changes in the Unicode Character Database

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 8.0 can be found in UAX #44, Unicode Character Database. The changes listed there include character additions and property revisions to existing characters that will affect implementations. Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in Section M.

G. Changes in the Unicode Standard Annexes

In Version 8.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex Changes
UAX #9Unicode Bidirectional Algorithm Introduced minor changes to handle identified edge cases in the algorithm.
UAX #11East Asian Width No significant changes in this version.
UAX #14Unicode Line Breaking Algorithm Added rule LB21b, to prevent a break between a solidus and Hebrew letters. Added a case to rule LB22, to prevent a break between exclamation marks and ellipses.
UAX #15Unicode Normalization Forms Updated Section 1.1 and associated figures regarding canonical and compatibility equivalence.
UAX #24Unicode Script Property Added clarifications to the principle used in determining the assignment of Script property values for characters predominantly used in one script, but also employed in other scripts.
UAX #29Unicode Text Segmentation Adjusted the Default Grapheme Cluster Boundary specification, to account for changes to the New Tai Lue model. Modified rule SB7 to prevent sentence breaks within a word segment such as "Mr.Hamster".
UAX #31Unicode Identifier and Pattern Syntax Added clarifications regarding case distinctions in identifiers in programming languages. Added text about middle dot in identifiers. Added new scripts to the list of characters for exclusions from identifiers.
UAX #34Unicode Named Character Sequences No significant changes in this version.
UAX #38Unicode Han Database (Unihan) Added documentation of the new kJa field and updated the description of the KIRG_JSource field. Updated the description of the kDefaultSortKey field to clarify its intended use. Updated the regular expression for the kIRG_GSource field.
UAX #41Common References for Unicode Standard Annexes No significant changes in this version.
UAX #42Unicode Character Database in XML Added new code point attributes, values, and patterns.
UAX #44 Unicode Character Database Updated the documentation of the 7.0 provisional Indic data files to reflect their new status as informative in Unicode 8.0 and the renaming of the provisional Indic_Matra_Category to the informative Indic_Positional_Category.
UAX #45 U-Source Ideographs UAX #45 did not change for this release, however three characters were added to the entries in USourceData.txt and USourceGlyphs.pdf.

H. Changes in Synchronized Unicode Technical Standards

There are also significant revisions in the Unicode Technical Standards whose versions are synchronized with the Unicode Standard. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UTS, linked directly from the following list of UTSes.

Unicode Technical Standard Changes
UTS #10Unicode Collation Algorithm Contractions for most Cyrillic accented letters have been removed from the DUCET.
UTS #46Unicode IDNA Compatibility Processing Values in the IDNA Comparisons table were updated for Unicode 8.0. A note was added to clarify the implications of Cherokee case folding for mapping in IDNA.

UTS #39, Security Mechanisms, has also been updated for Version 8.0.

M. Implications for Migration

There are a significant number of changes in Unicode 8.0 which may impact implementations which are upgrading to Version 8.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.

Casing and Case Folding of Cherokee

The character encoding model for the Cherokee script changed from unicameral to bicameral. The conversion was done by reclassifying all existing syllables as uppercase and adding a corresponding set of lowercase syllables. In terms of properties, the General_Category of the existing characters changed from Other_Letter to Uppercase_Letter, and the new characters were given the value Lowercase_Letter. A new case pair for the archaic syllable mv was also added.

The casing was chosen in order to reduce the migration cost for implementations, allowing them to preserve the font metrics for the existing characters and reduce the implications on layout. However, the formation of case pairs by adding lowercase characters is unusual. As a result, case folding of Cherokee maps to uppercase instead of lowercase. This mapping also has consequences on identifiers, as described in the changes to UAX #31, Unicode Identifier and Pattern Syntax.

Change in Encoding Model for New Tai Lue to Visual Order

The character encoding model for New Tai Lue changed from logical order, in which pre-base vowels are stored after an initial consonant, to visual order, in which the pre-base vowels are stored before the initial consonant, as for Thai, Lao, and Tai Viet. The model was changed to better serve the primary user community in the Xishuangbanna region of China, who have been accumulating data input and stored in visual order, and have been using fonts with a visual order encoding to render it.

The encoding model change incurred a uniform General_Category reclassification of all New Tai Lue vowels signs and tone marks from Spacing_Mark to Other_Letter, the assignment of the property value Logical_Order_Exception=Yes to the pre-base vowels U+19B5..U+19B7 and U+19BA, and the addition of 176 pre-base vowel + initial consonant contractions to the Default Unicode Collation Element Table.

A visual order model complicates syllable identification and the processes for searching and sorting. Implementations switching to the visual order model can take advantage of techniques developed for processing Thai script data to address the issues associated with visual order encoding, and data stored in logical order should be carefully migrated.

Version 8.0 adds six new scripts, so implementations which process script data should be carefully checked.

Additionally, there was a significant Script property value change affecting the common Arabic-Indic digits (U+0660..U+0669). These were changed from having the value "Common" to the value "Arabic". Their use with scripts other than Arabic is now more consistently dealt with by the Script_Extensions property, instead. Implementations which may have had special treatment for the Script property value of the Arabic-Indic digits should be checked to ensure that the change in Script property value does not cause unexpected behavior.

Changes for Deprecation of Language Tags

The range of tag characters (U+E0020..U+E007E) was changed from Deprecated=True to Deprecated=False in Version 8.0. This change was done to clear the way for the potential future use of tag characters for a purpose other than to represent language tags.

Note that two characters, U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG, remain deprecated. Furthermore, the use of tag characters to represent language tags in a plain text stream is still a deprecated mechanism for conveying language information about text.

Implementations which deliberately remove or refuse to interpret deprecated characters may need updates to prepare them for the potential use of U+E0020..U+E007E in Unicode 8.0 data in the future.

Glyph Changes

The representative glyph for U+301C WAVE DASH was updated, so that it now shows a tilde shape instead of a reversed tilde shape. The updated glyph now aligns with majority practice in fonts for this character.

The representative glyph for U+3127 BOPOMOFO LETTER I was changed from a vertical orientation to a horizontal orientation. The updated glyph now aligns with majority practice for both horizontal and vertical layout of Bopomofo text, but implementations should be checked to verify correct behavior for rendering of this character.

Version 8.0 made small adjustments to line break and other segmentation rules. In particular:

CJK Changes


Access to Copyright and terms of use