Unicode 6.3.0 (original) (raw)

Released: 2013 September 30 (Announcement)

Version 6.3.0 has been superseded by the latest version of the Unicode Standard.

This page summarizes the important changes for the Unicode Standard, Version 6.3.0.

The core specification was not republished for Version 6.3. Thus the chapters of the core specification use the Version 6.2.0 PDF files.

A. Summary
B. Version Information
C. Stability Policy Update
D. Textual Changes and Character Additions
E. Conformance Changes
F. Changes in the Unicode Character Database
G. Changes in the Unicode Standard Annexes
H. Changes in Synchronized Unicode Technical Standards

A. Summary
Version 6.3 of the Unicode Standard is a special release focused on delivering significantly improved bidirectional behavior.

Bidirectional Behavior Improvements
This new version updates the Unicode Bidirectional Algorithm to ensure that pairs of parentheses and brackets have consistent layout and to provide a mechanism for isolating runs of text.

The updated Bidirectional Algorithm together with five newly introduced bidi format characters will improve the display of text for hundreds of millions of users of Arabic, Hebrew, Persian, Urdu, and many others. The display and positioning of parentheses will better match the normal behavior that users expect. By using the new methods for isolating runs of text, software will be able to construct messages from different sources without jumbling the order of characters. The new bidi format characters correspond to features in markup (such as in CSS). Overall, these improvements bring greater interoperability and an improved ability for inserting text and assembling user interface elements in these languages.

The improvements come with new rigor: the Consortium now offers two reference implementations and greatly improved testing and test data.

Other Enhancements
In a major enhancement for CJK usage, this new version adds standardized variation sequences for all 1,002 CJK compatibility ideographs. These sequences address a well-known issue of the CJK compatibility ideographs—that they could change their appearance when any process normalized the text. Using the new standardized variation sequences allows authors to write text which will preserve the specific required shapes of these CJK ideographs, even under Unicode normalization.

Version 6.3 includes other improvements as well:

Improved Unihan data to better align with ISO/IEC 10646

Better support for Hebrew word break behavior and for ideographic space in line breaking

This version also rolls in a change in Definition D136 (case-ignorable) of the core specification, various minor corrections for errata, and other small updates for the Unicode Character Database.

Synchronization
Two other important Unicode specifications are maintained in synchrony with the Unicode Standard, and have updates for Version 6.3:

UTS #10, Unicode Collation Algorithm

UTS #46, Unicode IDNA Compatibility Processing

This version of the Unicode Standard is synchronized with ISO/IEC 10646:2012, plus the accelerated publication of 5 bidirectional format control characters: U+061C ARABIC LETTER MARK and the isolate span controls U+2066..U+2069.

See Sections D through H below for additional details regarding the changes in this version of the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.

B. Version Information
Version 6.3 of the Unicode Standard consists of the core specification (unchanged from Version 6.2, except for Definition D136), the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD).

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 6.3.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 6.3.0, (Mountain View, CA: The Unicode Consortium, 2013. ISBN 978-1-936213-08-5)
http://www.unicode.org/versions/Unicode6.3.0/

The terms “Version 6.3” or “Unicode 6.3” are abbreviations for the full version reference, Version 6.3.0.

The citation and permalink for the latest published version of the Unicode Standard is:

The Unicode Consortium. The Unicode Standard.
http://www.unicode.org/versions/latest/

A complete specification of the contributory files for Unicode 6.3 is found on the pageComponents for 6.3.0. That page also provides the recommended reference format for Unicode Standard Annexes.

The navigation bar on the left of this page provides links to both the core specification as a single file, as well as to individual chapters, and the appendices. Also provided are links to thecode charts, theradical-stroke indices to CJK ideographs, the Unicode Standard Annexes and the data files for Version 6.3 of the Unicode Character Database.

Code Charts
Several sets of code charts are available. They serve different purposes:

The latest set of code charts for the Unicode Standard are available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.

For Unicode 6.3.0 in particular two additional sets of code chart pages are provided:

A set of delta code charts showing the blocks in which bidirectional format controls were added for Unicode 6.3.0. Those characters are visually highlighted in the relevant chart. These delta code charts also include blocks which contain significant glyph changes to fix errata.

A set of archival code charts that represent the entire set of characters, names and representative glyphs at the time of publication of Unicode 6.3.0.

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Errata
Errata incorporated into Unicode 6.3 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 6.3, see the list of current Updates and Errata.

C. Stability Policy Update
The statement of the stability policy for the Bidi_Class property was slightly reworded to clarify the exact type of changes allowed for it. This update is related to the changes in Unicode 6.3.0 for the Unicode Bidirectional Algorithm.

A constraint was added for the new Bidi_Paired_Bracket_Type (bpt) property, to guarantee that characters given either bpt=Open or bpt=Close (intended to be limited to paired brackets) also have Bidi_Class=ON and Bidi_Mirrored=Yes, for consistency.

A new constraint was added to guarantee that characters with the General_Category property value Number also have a Numeric_Type property value distinct from None.

For details about each of these changes or additions, see Property Value Stability.

Note: The Unicode Character Encoding Stability Policy restricts possible future changes to the Unicode Standard, but is not formally a part of the standard itself.

D. Textual Changes and Character Additions
In Version 6.3 of the core specification, Section 3.13, Default Case Algorithms, Definition D136 has been updated as follows:

D136. A character C is defined to be case-ignorable if C has the value MidLetter (ML), MidNumLet (MB), or Single_Quote (SQ) for the Word_Break property or its General_Category is one of Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf), Modifier_Letter (Lm), or Modifier_Symbol (Sk).

Changes in the Unicode Standard Annexes are listed in Section G.

Character Assignment Overview
Five new character assignments were made for the Unicode Standard, Version 6.3, as shown in the following table. This addition brings the total number of characters assigned in the standard to 110,122. (That is the traditional count, which totals up graphic and format characters, but omits surrogate code points, ISO control codes, noncharacters, and private-use allocations.)

U+061C ARABIC LETTER MARK
U+2066 LEFT-TO-RIGHT ISOLATE
U+2067 RIGHT-TO-LEFT ISOLATE
U+2068 FIRST STRONG ISOLATE
U+2069 POP DIRECTIONAL ISOLATE

No new blocks are defined in Version 6.3.

E. Conformance Changes
In Version 6.3 of the core specification, the derivation of the property Case_Ignorable in Definition D136 has been updated to account for the change in the Word_Break property value of U+0027 APOSTROPHE from MidNumLet to Single_Quote.

Except for the update to Definition D136, there are no significant conformance changes in the core specification. However, there are significant conformance changes to the Unicode Bidirectional Algorithm in UAX #9, which may also affect incidental discussion about the Unicode Bidirectional Algorithm in several sections of the core specification.

F. Changes in the Unicode Character Database
The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 6.3 can be found in UAX #44, Unicode Character Database. The changes listed there include character additions and property revisions to existing characters that will affect implementations. The most notable changes are summarized below.

Changes Related to the Unicode Bidirectional Algorithm

The five newly-encoded characters are all Bidi_Control characters. U+061C ARABIC LETTER MARK, abbreviated ALM, is similar to the bidirectional ordering control RLM except that its Bidi_Class property value is AL. The explicit directional isolates U+2066..U+2069 mark a span of text as directionally isolated from its surroundings.

The Bidi_Class property has been extended with four new values for directional isolates.

Two new normative properties, Bidi_Paired_Bracket and Bidi_Paired_Bracket_Type, have been introduced together with a new normative contributory file, BidiBrackets.txt, for the specification of bracket pairs in bidirectional text.

The General_Category property values of the floor and ceiling delimiters, U+2308..U+230B, have been changed from Sm to Ps or Pe, to form bidirectional bracket pairs.

A new conformance test data file has been added, BidiCharacterTest.txt, and the existing BidiTest.txt has been augmented with test cases containing new edge cases and the new Bidi_Class property values.

Changes Related to Line Breaking and Text Segmentation

The Line_Break property value of U+3000 IDEOGRAPHIC SPACE has been changed from ID to BA.

Hebrew letters and basic punctuation marks have been assigned the newly introduced Word_Break property values Hebrew_Letter, Single_Quote, and Double_Quote.

U+02D7 MODIFIER LETTER MINUS SIGN has been assigned the Word_Break property value MidLetter.

Changes Related to CJK Characters and the Unihan Database

A set of 245 new U-Source ideographs has been added.

A set of 1002 standardized variation sequences has been added, one sequence per CJK compatibility ideograph in Unicode 6.3. The sequences consist of CJK unified ideographs and variation selectors U+FE00..U+FE02, and have the intended visual appearance of the corresponding CJK compatibility ideographs.

The kHanyuPinlu fields have been revised systematically to use accents instead of numbers for tones.

Miscellaneous Changes

Mongolian and Phags-pa characters have been given a Joining_Type classification for contextual shaping. As a part of these additions, one Phags-pa character has the Joining_Type value of L (Left Joining), which no character had been assigned before. This change may impact the implementations of cursive rendering engines.

The General_Category property value of U+180E MONGOLIAN VOWEL SEPARATOR has been changed from Zs to Cf. The values of other related properties such as Bidi_Class, White_Space, and Other_Default_Ignorable_Code_Point have been updated accordingly.

The unassigned code points in the Currency Symbols block have been given the Bidi_Class property value ET and the Line_Break property value PR, to help implementations support new currency symbols, when they are encoded.

Nine named character sequences have been added for Uighur and Chagatai.

G. Changes in the Unicode Standard Annexes
In Version 6.3, many of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex Changes

UAX #9Unicode Bidirectional Algorithm The Unicode Bidirectional Algorithm was substantially extended to support isolate runs and to resolve paired brackets as a unit. For the former extension, four new Bidi_Class property values were added. For the latter, two normative properties and an algorithm rule N0 were introduced. Additional definitions, rule revisions, notes, and examples were included, and a new test file was added.

UAX #11East Asian Width No significant changes in this version.

UAX #14Unicode Line Breaking Algorithm The description of the CM class was updated to reflect a refinement in line breaking for U+3035 VERTICAL KANA REPEAT MARK LOWER HALF, and the description of the BA class was updated to reflect a change for U+3000 IDEOGRAPHIC SPACE.

UAX #15Unicode Normalization Forms No significant changes in this version.

UAX #24Unicode Script Property No significant changes in this version.

UAX #29Unicode Text Segmentation There were some minor updates made for word segmentation. Apostrophe and double quote are now allowed within a strictly Hebrew word context, to reflect their common use in place of geresh and gershayim.

UAX #31Unicode Identifier and Pattern Syntax No significant changes in this version.

UAX #34Unicode Named Character Sequences No significant changes in this version.

UAX #38Unicode Han Database (Unihan) The status of kCompatibilityVariant was clarified. kHanyuPinlu was changed to use accents instead of numbers for tones, and the regular expression for it was modified accordingly. Many other minor documentation updates were made.

UAX #41Common References for Unicode Standard Annexes Minor updates were made to the references.

UAX #42Unicode Character Database in XML Changes were made to track additional properties and property values for the Unicode Bidirectional Algorithm.

UAX #44 Unicode Character Database The status of default values was clarified. Numerous changes were made to reflect changes to the Unicode Bidirectional Algorithm and its associated character properties and data files. A clarification was added about Numeric_Type=Digit.

UAX #45 U-Source Ideographs 245 characters were added to the list of U-Source ideographs. A new status of UNC-2013 was added and documented.

H. Changes in Synchronized Unicode Technical Standards
There are also significant revisions in the Unicode Technical Standards whose versions are synchronized with the Unicode Standard. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UTS, linked directly from the following list of UTSes.

Unicode Technical Standard Changes

UTS #10Unicode Collation Algorithm The CLDR root collation data files contained in CollationAuxiliary.zip, along with the related documentation, have been moved from the UCA release directory to the root collation data files in the CLDR repository. Trailing collation elements are now given regular tertiary weights in DUCET, which allows for full case differences among compatibility characters. Digits from all scripts are now given the same weights as ASCII digits in DUCET, rather than being distinguished by secondary weights. The IgnoreSP option for handling variables (intended for ignoring punctuation but not symbols) has been removed. The weights 0xFFFD..0xFFFF are now reserved for special collation elements. In addition, the text of UTS #10 has been reorganized for better flow.

UTS #46Unicode IDNA Compatibility Processing The five new bidirectional format controls were added. They are given the value ignored in IdnaMappingTable.txt. They have the status disallowed in IDNA2008.

Unicode Standard Annex	Changes
UAX #9Unicode Bidirectional Algorithm	The Unicode Bidirectional Algorithm was substantially extended to support isolate runs and to resolve paired brackets as a unit. For the former extension, four new Bidi_Class property values were added. For the latter, two normative properties and an algorithm rule N0 were introduced. Additional definitions, rule revisions, notes, and examples were included, and a new test file was added.
UAX #11East Asian Width	No significant changes in this version.
UAX #14Unicode Line Breaking Algorithm	The description of the CM class was updated to reflect a refinement in line breaking for U+3035 VERTICAL KANA REPEAT MARK LOWER HALF, and the description of the BA class was updated to reflect a change for U+3000 IDEOGRAPHIC SPACE.
UAX #15Unicode Normalization Forms	No significant changes in this version.
UAX #24Unicode Script Property	No significant changes in this version.
UAX #29Unicode Text Segmentation	There were some minor updates made for word segmentation. Apostrophe and double quote are now allowed within a strictly Hebrew word context, to reflect their common use in place of geresh and gershayim.
UAX #31Unicode Identifier and Pattern Syntax	No significant changes in this version.
UAX #34Unicode Named Character Sequences	No significant changes in this version.
UAX #38Unicode Han Database (Unihan)	The status of kCompatibilityVariant was clarified. kHanyuPinlu was changed to use accents instead of numbers for tones, and the regular expression for it was modified accordingly. Many other minor documentation updates were made.
UAX #41Common References for Unicode Standard Annexes	Minor updates were made to the references.
UAX #42Unicode Character Database in XML	Changes were made to track additional properties and property values for the Unicode Bidirectional Algorithm.
UAX #44 Unicode Character Database	The status of default values was clarified. Numerous changes were made to reflect changes to the Unicode Bidirectional Algorithm and its associated character properties and data files. A clarification was added about Numeric_Type=Digit.
UAX #45 U-Source Ideographs	245 characters were added to the list of U-Source ideographs. A new status of UNC-2013 was added and documented.