Unicode 4.1.0 (original) (raw)
Version 4.1.0 has been superseded by thelatest version of the Unicode Standard.
Version 4.1.0 of the Unicode Standard consists of the core specification, The Unicode Standard, Version 4.0, as amended by Unicode 4.0.1 and further amended by this specification, the delta and archival code charts for this version, the Unicode Standard Annexes, and the Unicode Character Database (UCD). The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard. Version 4.1.0 of the Unicode Standard should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 4.1.0, defined by: The Unicode Standard, Version 4.0 (Boston, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/)
and by Unicode 4.1.0 (http://www.unicode.org/versions/Unicode4.1.0/).A complete specification of the contributory files for Unicode 4.1.0 is found on the page Components for Version 4.1.0.
Contents of This Document
Online Edition
Overview
Notable Changes From Unicode 4.0.1 to Unicode 4.1.0
Conformance Changes to the Standard
Other Changes to the Standard
Superseded Sections
Unicode Character Database
Errata Corrected in This Version
Script Additions
Significant Character AdditionsOnline Edition
The text of The Unicode Standard, Version 4.0, as well as the delta and archival code charts, is available online via the navigation links on this page. These files may not be printed. The Unicode 4.0 Web Bookmarks page has links to all sections of the online text.
Overview
Unicode 4.1.0 is aminor version of the Unicode Standard. 1273 new characters have been added. This document provides information about those additional characters, as well as further clarifications of text of the standard. In addition it covers accumulated corrigenda and errata to the text.
There are significant changes to many of the Unicode Standard Annexes which are part of Unicode 4.1.0. Each annex has a modification section listing the changes in that annex.
Notable Changes From Unicode 4.0.1 to Unicode 4.1.0
- Addition of 1273 new characters to the standard, including those to complete roundtrip mapping of the HKSCS and GB 18030 standards, five new currency signs, some characters for Indic and Korean, and eight new scripts. (The exact list of additions can be seen in DerivedAge.txt, in the age=4.1 section.)
- Change in the end of the CJK Unified Ideographs range from U+9FA5 to U+9FBB, with the addition of some Han characters. The boundaries of such ranges are sometimes hardcoded in software, in which case the hardcoded value needs to be changed.
- New Unicode Standard Annexes:UAX #31, Identifier and Pattern Syntax and UAX #34, Unicode Named Character Sequences, and significant changes to other Unicode Standard Annexes.
In addition to the repertoire additions, there have been a number of significant changes to the Unicode Character Database files and the properties in them. In particular:
- Three new properties, Grapheme_Cluster_Break, Sentence_Break, and Word_Break, have been added in support ofUAX #29, Text Boundaries. Their enumeration can be found in new data files, located in the new "auxiliary" subdirectory of the UCD. Also in the "auxiliary" subdirectory are the test data files and HTML break charts associated with UAX #29.
- The new property Other_ID_Continue has been added to support identifier stability. It is enumerated in PropList.txt and is used in the derivation of other identifier-related properties.
- Two new properties, Pattern_Syntax and Pattern_White_Space, have been added in support ofUAX #31, Identifier and Pattern Syntax. Their enumeration can be found in PropList.txt.
- The bidi properties of a few compatibility equivalents of characters whose bidi classes changed for Unicode 4.0.1 have been harmonized.
- The case mapping contexts defined in SpecialCasing.txt have been updated and now override Table 3-13. Context Specification for Casing on p. 89 of The Unicode Standard, Version 4.0. These changes are described below in the section Modifications to Default Case Operations.
- Alphabetic is now a superset of Lowercase and Uppercase for compatibility with POSIX-style character classes.
- A new data file NamedSequences.txt has been added in conjunction withUAX #34, Named Character Sequences. This data file defines specific names for some significant Unicode character sequences, giving their USI (Unicode Sequence Identifiers) values.
- The linebreak propeties of Runic, Indic, Mongolian, Tibetan punctuation, and Hangul have been revised to better match their expected behavior. (SeeUAX #14: Line Breaking Properties)
The following complete scripts have been added in Unicode 4.1.0:
- New Tai Lue (U+1980..U+19DF)
- Buginese (U+1A00..U+1A1F)
- Glagolitic (U+2C00..U+2C5F)
- Coptic (U+2C80..U+2CFF)
- Tifinagh (U+2D30..U+2D7F)
- Syloti Nagri (U+A800..U+A82F)
- Old Persian (U+103A0..U+103DF)
- Kharoshthi (U+10A00..U+10A5F)
Two scripts have been disunified or reorganized:
- Coptic is now considered a separate script from Greek. This differs from prior documentation in the standard. A new Coptic block has been added, including characters for Old Coptic. It should be noted, however, that the 14 Coptic letters derived from Demotic, which had already been encoded in the Greek and Coptic block, are unchanged, and need to be included in any complete implementation of Coptic.
- The Nuskhuri forms of Khutsuri Georgian have been added in a new Georgian Supplement block (U+2D00..U+2D2F). Those characters are now to be taken as the lowercase pairs of the Asomtavruli Georgian encoded at U+10A0..U+10C5. This introduction of case pairs for Khutsuri is a change from the previous documentation about Georgian in the standard.
Beyond the addition of entire scripts, there have been very significant extensions to the repertoire for the Arabic script and the Ethiopic script. A large number of additional Latin characters have been added as phonetic extensions to support various orthographic conventions for minority languages. There are also significant additions of Greek symbols and punctuation to support specialist representation of ancient Greek materials. Several small sets have been added to CJK Unified Ideographs and to associated blocks.
A few characters have been added to supplement Hebrew, in particular for support of Biblical Hebrew text representation.
U+060B AFGHANI SIGN has been added. While some glyph variants of this character do occur, the form shown in the code charts is that approved by the Ministry of Finance of the Afghanistan government.
U+09CE BENGALI LETTER KHANDA TA has been added. This will necessitate adjustment of Bengali script implementations. In Unicode 4.1, recommendations for the representation of Khanda-Ta in Bengali differ from those documented in Version 4.0.1 and earlier
Conformance Changes to the Standard
Modifications to Default Case Operations
The following amends Section 3.13, Default Case Operations, on p. 89-90 of The Unicode Standard, Version 4.0.
Add after D47:
D47a A character C is defined to be case-ignorable if C has the Unicode Property Word_Break=MidLetter as defined in Unicode Standard Annex #29, "Text Boundaries;" or the General Category of C is Nonspacing Mark (Mn), Enclosing Mark (Me), Format Control (Cf), Letter Modifier (Lm), or Symbol Modifier (Sk).
D47b A case-ignorable sequence is a sequence of zero or more case-ignorable characters.
Replace Table 3-13, Context Specification by the following:
A description of each context is followed by the equivalent regular expression(s) describing the context before C and the context after C, or both. The regular expression uses the syntax of Unicode Technical Standard #18, "Unicode Regular Expressions ", with one addition: "!" means that the expression does not match. All regular expressions below are case-sensitive.
Table 3-13. Context Specification for Casing
Context Description Regular Expressions Final_Sigma C is preceded by a sequence consisting of a cased letter and a case-ignorable sequence, and C is not followed by a sequence consisting of an ignorable sequence and then a cased letter. Before C: \p{cased} (\p{case-ignorable})* After C: ! ( (\p{case-ignorable})* \p{cased} ) After_Soft_Dotted There is a Soft_Dotted character before C, with no intervening character of combining class 0 or 230 (ABOVE). Before C: [\p{Soft_Dotted}] ([^\p{cc=230} \p{cc=0}])* More_Above C is followed by a character of combining class 230 (ABOVE), with no intervening character of type 0. After C: [^\p{cc=0}]* [\p{cc=230}] Before_Dot C is followed by combining dot above (U+0307). Any sequence of characters with a combining class that is neither 0 nor 230 may intervene between the current character and the combining dot above. After C: ([^\p{cc=230} \p{cc=0}])* [\u0307] After_I There is an uppercase I before C, and there is no intervening combining character class 230 (ABOVE) or 0. Before C: [I] ([^\p{cc=230} \p{cc=0}])* Clarification of Decomposition Mappings
In order to ensure, as intended, that decomposition mappings for each version of the standard derive from the Unicode Character Database for that version of the standard, the phrases in D18, D20, and D23 reading "according to the decomposition mappings found in the names list of Section 16.1, Character Names List" is changed to "according to the decomposition mappings found in the Unicode Character Database".
Other Changes to the Standard
Change in status of recommendation of SPACE as a base for display of nonspacing marks.
The UTC has decided that U+0020 SPACE is no longer recommended as a suitable base character for display of isolated nonspacing marks. Instead, U+00A0 NO-BREAK SPACE is the preferred base character for this function.
The explanatory text of The Unicode Standard Version 4.0, page 46, "Spacing Clones of European Diacritical Marks" is updated to read as follows:
Nonspacing combining marks used by the Unicode Standard may be exhibited in apparent isolation by applying them to U+00A0 NO-BREAK SPACE. This convention might be employed, for example, when talking about the combining mark itself as a mark, rather than using it in its normal way in text applied as an accent to a base letter or in other combinations.
Prior to Version 4.1 of the Unicode Standard, the standard also recommended the use of U+0020 SPACE for display of isolated combining marks. This is no longer recommended because of potential conflicts with the handling of sequences of U+0020 space characters in such contexts as XML.
The Unicode Standard separately encodes clones of many common European diacritical marks, primarily for compatibility with existing character set standards. These cloned accents and diacritics are spacing characters, and can be used to display the mark in isolation, without application to a no-break space. They are cross-referenced to the corresponding combining mark in the names list in Chapter 16, Code Charts. For example, U+02D8 BREVE is cross-referenced to U+0306 COMBINING BREVE. Most of these spacing clones also have compatibility decomposition mappings involving U+0020 SPACE, but implementers should be cautious in making use of those decomposition mappings because of the complications that can result from replacing a spacing character with a space + combining mark sequence.
SeeUAX #14: Line Breaking Properties for corresponding changes.
Change in equivalence for NO-BREAK SPACE
The Unicode Standard, Version 4.0, p. 387 states:
U+00A0 NO-BREAK SPACE behaves like the following coded character sequence: U+FEFF ZERO WIDTH NO-BREAK SPACE + U+0020 SPACE + U+FEFF ZERO WIDTH NO-BREAK SPACE.
That sentence is stricken from the text of the Unicode Standard, Version 4.1.0, because it is incorrect. The behavior in bidirectional text layout is not identical for these sequences (seeUAX #9: The Bidirectional Algorithm). For linebreaking, there are differences with respect to a following SPACE character (seeUAX #14: Line Breaking Properties). In addition, the use of U+FEFF for word-joining has been deprecated in favor of U+2060 WORD JOINER.
Use of CGJ to Prevent Reordering
The following modifies the section headed Combining Grapheme Joiner in Section 15.2, Layout Controls on page 392 of The Unicode Standard, Version 4.0.
Replace this text on page 392:
U+034F COMBINING GRAPHEME JOINER is used to indicate that adjacent characters are to be treated as a unit for the purposes of language-sensitive collation and searching. In language-sensitive collation and searching, the combining grapheme joiner should be ignored unless it specifically occurs within a tailored collation element mapping. Thus it is given a completely ignorable collation element in the default collation table, like NULL (see Unicode Technical Standard #10, "Unicode Collation Algorithm," and also ISO/IEC 14651). However, it can be entered into the tailoring rules for any given language, using the tailoring capabilities of the collation standards.
by the following text:
U+034F COMBINING GRAPHEME JOINER is used to affect the collation of adjacent characters for purposes of language-sensitive collation and searching, and to distinguish sequences that would otherwise be canonically equivalent.
Formally, the combining grapheme joiner is not a format control character, but rather a combining mark. It has the General Category value gc=Mn and the canonical combining class value ccc=0. These property assignments result in the following behavior, which can be useful in certain circumstances.
The presence of a combining grapheme joiner in the midst of a combining character sequence does not interrupt the combining character sequence; a process which is accumulating and processing all the characters of a combining character sequence would include a combining grapheme joiner as part of that sequence. (This differs from the behavior for most format control characters, whose presence would interrupt a combining character sequence.) However, because the combining grapheme joiner has a combining class of 0, canonical reordering will not reorder any adjacent combining marks around a combining grapheme joiner. (See the definition of canonical reordering in Section 3.11, Canonical Reordering Behavior in Unicode 4.0.) In turn, this means that insertion of a combining grapheme joiner_between_ two combining marks will prevent normalization from switching the position of those two combining marks, regardless of their own combining classes.
This side-effect of the character properties of the combining grapheme joiner, together with the fact that the combining grapheme joiner has no visible glyph and no other format effect on neighboring characters, can be taken advantage of in those exceptional circumstances where two alternative orderings of a sequence of combining marks must be distinguished for some processing or rendering purpose and where normalization would otherwise eliminate the distinction between the two sequences.
For example, this is one way to avoid the less-than-optimal assignment of fixed-position combining classes to certain Hebrew accents and marks which do in fact interact typographically and for which accent order distinctions need to be maintained for analytic and text representational purposes. In particular:
<lamed, patah, hiriq, finalmem>
is canonically equivalent to:
<lamed, hiriq, patah, finalmem>
because the canonical combining classes of U+05B4 HEBREW POINT HIRIQ and U+05B7 HEBREW POINT PATAH are distinct. However, if an application wishes to make a distinction between a patah following hiriq and a patah preceding a hiriq, the following sequence would_not_ be canonically equivalent to the first two sequences cited:
<lamed, patah, CGJ, hiriq, finalmem>
The presence of the ccc=0 combining grapheme joiner blocks the reordering of hiriq before patah by canonical reordering. That allows the two sequences to be reliably distinguished, whether for display or for other processing.
The Unicode Collation Algorithm involves the normalization of Unicode text strings before collation weighting. The combining grapheme joiner is ordinarily ignored in collation key weighting in the UCA, but if, as in this case, it is used to block the reordering of combining marks in a string, its effect can be to invert the order of secondary key weights associated with those combining marks. Because of this, the two strings would have distinct keys, making it possible to treat them distinctly in searching and sorting without having to further tailor either the combining grapheme joiner or the combining marks themselves.
The CGJ can also be used to prevent the formation of contractions in the Unicode Collation Algorithm. Thus, for example, while "ch" is sorted as a single unit in a tailored Slovak collation, the sequence <c, CGJ, h> will sort as a 'c' followed by an 'h'. This can also be used in German, for example, to force 'ü' to be sorted as 'u' + umlaut (using <u, CGJ, umlaut>), even where a dictionary sort is being used. This also happens without having to further tailor either the combining grapheme joiner or the sequence.
Of course, sequences of characters which include the combining grapheme joiner may also be given tailored weights. Thus the sequence <c, CGJ, h> could be weighted completely differently from the either the contraction "ch" or how "c" and "h" would have sorted without the contraction. However, this application of CGJ is not recommended. For more information on the use of CGJ with sorting, matching, and searching, see UAX #10: Unicode Collation Algorithm, Version 4.1.0.
Meteg
The following clarifying text regarding the control of positioning of the meteg in Hebrew, U+05BD HEBREW POINT METEG, should be added to Section 8.1, Hebrew, p. 194 of The Unicode Standard, Version 4.0.
The basic recommendations for the control of positioning of_meteg_ established in Version 4.1 are as follows:
U+034F COMBINING GRAPHEME JOINER can be used within a vowel-meteg sequence to preserve an ordering distinction under normalization.
So, for instance, to display meteg to the left (after, for a right-to-left script) of the vowel point sheva, U+05B0 HEBREW POINT SHEVA, the following sequence can be used:
<sheva, meteg>
Because these marks are canonically ordered, this sequence is preserved under normalization. Then, to display meteg to the right of the sheva, the following sequence can be used:
<meteg, CGJ, sheva>
A further complication arises for combinations of meteg with hataf vowels. Authors who want to ensure left-position versus medial-position display of meteg with hataf vowels across all font implementations may use joiner characters to distinguish these cases.
Thus, the following encoded representations can be used for different positioning of meteg with a hataf vowel, such as hataf patah:
left-positioned meteg: <hataf patah, ZWNJ, meteg> medially-positioned meteg: <hataf patah, ZWJ, meteg> right-positioned meteg: <meteg, CGJ, hataf patah> In no case is use of ZWNJ, ZWJ, or CGJ required for representation of meteg. These recommendations are simply provided for interoperability in those instances where authors wish to preserve specific positional information regarding the layout of a meteg in text.
Rendering of Thai Combining Marks
Thai tone marks are a type of combining mark displayed above an associated base character; they have a combining class of 107. Other Thai combining marks displayed above — in particular vowels — have a combining class of 0. This assignment of combining classes is insufficient to fully characterize the typographic interaction between those marks.
For the purpose of rendering, the Thai combining marks above (U+0E31, U+0E34..U+0E37, U+0E47..U+0E4E) should be displayed outward from the base character they modify, in the order in which they appear in the text. In particular, a sequence containing <U+0E48 THAI CHARACTER MAI EK, U+0E4D THAI CHARACTER NIKHAHIT> should be displayed with the nikhahit above the mai ek, and a sequence containing <U+0E4D THAI CHARACTER NIKHAHIT, U+0E48 THAI CHARACTER MAI EK> should be displayed with the mai ek above the nikhahit.
This does not preclude input processors from helping the user by pointing out or correcting typing mistakes, perhaps taking into account the language. For example, because the string <mai ek, nikhahit> is not useful for the Thai language and is likely a typing mistake, an input processor could reject it or correct it to <nikhahit, mai ek>.
When the character U+0E33 THAI CHARACTER SARA AM follows one or more tone marks (U+0E48 .. U+0E4B), the nikhahit that is part of the sara am should be displayed below those tone marks. In particular, a sequence containing <U+0E48 THAI CHARACTER MAI EK, U+0E33 THAI CHARACTER SARA AM> should be displayed with the mai ek above the nikhahit.
Superseded Sections
- Section 5.15 Identifiers of The Unicode Standard, Version 4.0 is
superseded by Section 2 of UAX #31: Identifier and Pattern Syntax.- Annex 7 inUAX #15: Unicode Normalization Forms has been expanded and moved to Section 5 ofUAX #31: Identifier and Pattern Syntax.
Unicode Character Database
The complete Unicode Character Database files for this version are available in the4.1.0 directory. For more detailed information about the changes in the Unicode Character Database, see the file UCD.html in the Unicode Character Database.
Errata Corrected in This Version
Errata corrected in this version are listed by date in a separate table. For corrigenda and errata after the release of Unicode 4.1.0, see the list of current Updates and Errata.
Script Additions
New Tai Lue: U+1980 - U+19DF
The New Tai Lue script, also known as Xishuang Banna Dai, is used mainly in southern China. The script was developed in the twentieth century as an orthographic simplification of the historic Lanna script used to write the Tai Lue language.
New Tai Lue differs from Lanna in that it regularizes the consonant repertoire, simplifies the writing of consonant clusters and syllable-final consonants, and uses only spacing vowel signs, which appear before or after the consonants they modify. By contrast, Lanna uses both spacing vowel signs and nonspacing vowel signs which appear above or below the consonants they modify. All vowel signs in New Tai Lue are considered combining characters and follow their base consonants in the text stream. Where a syllable is composed of a vowel sign to the left and a vowel sign or tone mark on the right of the consonant, a sequence of characters is used, in the order consonant + vowel + tone mark.
A virama or killer character is not used to create conjunct consonants in New Tai Lue, because clusters of consonants do not regularly occur. New Tai Lue has a limited set of final consonants, which are modified with a hook showing that the inherent vowel is killed.
Similar to the Thai and Lao scripts, New Tai Lue consonant letters come in pairs that denote two tonal registers. The tone of a syllable is indicated by the combination of the tonal register of the consonant letter plus a tone mark written at the end of the syllable.
Buginese: U+1A00 - U+1A1F
The Buginese script is used on the island of Sulawesi, mainly in the southwest. A variety of traditional literature has been printed in it. The script is one of the easternmost of the Brahmi scripts and is perhaps related to Javanese. It bears some affinity to Tagalog, and it does not traditionally record final consonants. The Buginese language, an Austronesian language with a rich traditional literature, is one of the foremost languages of Indonesia. The script was previously also used to write the Makassar, Bimanese, and Madurese languages.
Glagolitic: U+2C00 - U+2C5F
Glagolitic, from the Slavic root "glagol" meaning "word", is an alphabet considered to have been devised by St. Cyril in the ninth century CE, for his translation of the Scriptures and liturgical books into Slavonic. Glagolitic was eventually supplanted by the alphabet now known as Cyrillic, which probably arose in late ninth-century Bulgaria. In parts of Croatia where a vernacular liturgy was used, Glagolitic continued in use until modern times; in these areas Glagolitic is still occasionally used as a decorative alphabet. Like Cyrillic, the Glagolitic script is written in linear sequence from left to right with no contextual modification of the letterforms.
Glagolitic is treated as a separate alphabet from Cyrillic because of its historical primacy, and because the letter shapes in the two alphabets are completely dissimilar: the one can in no sense be regarded as a variant of the other. Glagolitic itself exists in two styles, known as round and square. Round Glagolitic is the original style and more geographically widespread; square Glagolitic was used in Croatia from the thirteenth century. The letterforms used in the charts are round Glagolitic.
Coptic: U+2C80 - U+2CFF
Coptic is now considered a separate script from Greek in the Unicode Standard. This differs from prior documentation in the standard, for which Coptic was considered to be a stylistic variant of Greek, to be implemented by a font shift.
Starting with Unicode Version 4.1.0, a separate Coptic script block has been added at U+2C80..U+2CFF. The block contains the common Coptic alphabet, but also contains extensions needed for Old Coptic and dialectal usage of the Coptic script. It also contains Coptic-specific symbols and punctuation.
The long-encoded 14 Coptic letters derived from Demotic, encoded in the range U+03E2..U+03EF in the Greek and Coptic block, are also considered part of the Coptic script, and should be included in any complete implementation of the script.
Any implementations of Coptic predating Unicode Version 4.1.0 should be carefully checked, since use of Greek characters with Coptic-style fonts is no longer recommended for Coptic data.
Tifinagh: U+2D30 - U+2D7F
The Tifinagh script is used by around 20 million people in Morocco for writing Berber languages including Tarifite, Tamazighe, and Tachelhite. The teaching of Berber, written in Tifinagh, will be generalized and compulsory in Morocco. It is scheduled to be taught in all public schools by 2008. Historically the script has been used in several variant traditions along the Mediterranean coast from Kabylia to Morocco and the Canary Islands, the Constantinois and Aurès regions, as well as in Tunisia.
Syloti Nagri: U+A800 - U+A82F
The Syloti Nagri is a lesser-known Brahmi-derived script used for writing the Sylheti language. Sylheti is an Indo-European language spoken by some 5 million speakers in the Barak Valley region of northeast Bangladesh and southeast Assam (India). Sylheti has commonly been regarded as a dialect of Bengali, with which it shares a high proportion of vocabulary. The Sylheti Nagri script has 27 consonant letters with an inherent vowel of /o/, and 5 independent vowel letters. There are five dependent vowel signs which are attached to a consonant letter. Included in the encoding are several script-specific punctuation marks.
Old Persian: U+103A0 - U+103DF
Old Persian is found in a number of inscriptions in the Old Persian language dating from the Achaemenid Empire. It is an alphabetic writing system with some syllabic aspects. While the shapes of some Old Persian letters may look similar to signs in Sumero-Akkadian Cuneiform, it is clear that only one of them was borrowed from Sumero-Akkadian Cuneiform. Scholars today agree that the character inventory of Old Persian was newly-invented for the purpose of providing monumental inscriptions of the Achaemenid king, Darius I, by about 525 BCE.
Old Persian is written from left to right. The repertoire contains 36 signs which represent consonants, vowels or sequences of single consonants plus vowels, a set of five numbers, one word divider, and eight ideograms.
Kharoshthi: U+10A00 - U+10A5F
The Kharoshthi script was used historically to write Gāndhārī and Sanskrit as well as various mixed dialects. Kharoshthi is an Indic script of the abugida type. However, unlike other Indic scripts, it is written from right to left. The Kharoshthi script was initially deciphered around the middle of the nineteenth century by James Prinsep and others who worked from short Greek and Kharoshthi inscriptions on the coins of the Indo-Greek and Indo-Scythian kings. The decipherment has been refined over the last 150 years as more material has come to light. Representation of Kharoshthi in the Unicode code charts uses forms based on manuscripts of the first century CE.
Kharoshthi can be implemented using the rules of the Unicode bidirectional algorithm. In Kharoshthi both letters and digits are written from right to left. Rendering requirements for Kharoshthi are similar to those for Devanagari.
Significant Character Additions
In addition to encodings of entirely new scripts in Unicode Version 4.1.0, there have been other significant additions to the character repertoire. In some instances, these consist of major or minor extensions of existing scripts, and in other instances consist of specialized sets of punctuation, modifier letters or other symbols. These additions are sorted by category and explained in the sections below.
Arabic Supplement: U+0750-U+077F
Unicode 4.1 adds 30 additional extended Arabic letters mainly for the languages used in Northern and Western Africa, such as Fulfulde, Hausa, Songhoy and Wolof. In the second half of the twentieth century, the use of the Arabic script was actively promoted for these languages. Characters used for other languages are annotated in the character names list. Additional vowel marks used with these languages are found in the main Arabic block.
Ethiopic Extensions: U+1380 - U+139F, U+2D80 - U+2DDF
The Ethiopic script is used for a large number of languages and dialects in Ethiopia, and in some instances has been extended significantly beyond the set of characters used for major languages such as Amharic and Tigré. Unicode Version 4.1.0 adds two blocks of extensions to the Ethiopic script: Ethiopic Supplement U+1380..U+139F and Ethiopic Extended U+2D80..U+2DDF. Those extensions cover such languages as Me'en, Blin, and Sebatbeit, which use many additional characters. Several other characters have been added to the main Ethiopic script block in the range U+1200..U+137F, including one additional Ethiopic punctuation mark, and a combining mark used to indicate gemination.
In the Ethiopic Supplement block there is also a new set of tonal marks. These are used in multiline scored layout, and as for other musical (an)notational systems of this type, require a higher-level protocol to enable proper rendering.
Additions for Biblical Hebrew
Five new Hebrew characters have been added in Unicode 4.1 for special usage in Biblical Hebrew text:
U+05A2 HEBREW ACCENT ATNAH HAFUKH
U+05BA HEBREW POINT HOLAM HASER FOR VAV
U+05C5 HEBREW MARK LOWER DOT
U+05C6 HEBREW PUNCTUATION NUN HAFUKHA
U+05C7 HEBREW POINT QAMATS QATANIn some older versions of Biblical text, a distinction is made between the accents U+05A2 HEBREW ACCENT ATNAH HAFUKH and U+05AA HEBREW ACCENT YERAH BEN YOMO. Many editions from the last few centuries do not retain this distinction, using only yerah ben yomo, but some users in recent decades have begun to re-introduce this distinction. Similarly, a number of publishers of Biblical or other religious texts have introduced a typographic distinction for the vowel point qamats corresponding to two different readings. The original letterform used for one reading is referred to as qamats or qamats gadol; the new letterform for the other reading is qamats qatan. It is important to note that not all users of Biblical Hebrew use atnah hafukh and qamats qatan. If the distinction between accents atnah hafukh and yerah ben yomo is not made, then only U+05AA HEBREW ACCENT YERAH BEN YOMO is used. If the distinction between vowels qamats gadol and qamats qatan is not made, then only U+05B8 HEBREW POINT QAMATS is used. Implementations that support Hebrew accents and vowel points may not necessarily support the special-usage characters U+05A2 HEBREW ACCENT ATNAH HAFUKH and U+05C7 HEBREW POINT QAMATS QATAN.
The vowel point holam represents the vowel phoneme /o/. The consonant letter vav represents the consonant phoneme /w/, but in some words is used to represent a vowel, /o/. When the point holam is used on vav, the combination usually represents the vowel /o/, but in a very small number of cases represents the consonant-vowel combination /wo/. A typographic distinction is made between these two in many versions of Biblical text. In most cases, in which vav + holam together represents the vowel /o/, the point holam is centered above the vav and referred to as holam male. In the less frequent cases, in which the vav represents the consonant /w/, some versions show the point holam positioned above left. This is referred to as holam haser. The character U+05BA HEBREW POINT HOLAM HASER FOR VAV is intended for use as holam haser only in those cases where a distinction is needed. When the distinction is made, the character U+05B9 HEBREW POINT HOLAM is used to represent the point holam male on vav. U+05BA HEBREW POINT HOLAM HASER FOR VAV is intended for use only on vav; results of combining this character with other base characters are not defined. Not all users distinguish between the two forms of holam, and not all implementations can be assumed to support U+05BA HEBREW POINT HOLAM HASER FOR VAV.
In the Hebrew Bible, dots are written in various places above or below the base letters that are distinct from the vowel points and accents. These are referred to by scholars as puncta extraordinaria, and there are two kinds. The upper punctum is the more common of the two, and has been encoded since Unicode 2.0 as U+05C4 HEBREW MARK UPPER DOT. The lower punctum is used only in one verse of the Bible, Psalm 27:13, and has been added in Unicode 4.1 as U+05C5 HEBREW MARK LOWER DOT. The puncta generally differ in appearance from dots that occur above letters used to represent numbers; the number dots should be represented using U+0307 COMBINING DOT ABOVE and U+0308 COMBINING DIAERESIS.
The nun hafukha is a special symbol that appears to have been used for scribal annotations, though its exact functions are uncertain. It is used a total of nine times in the Hebrew Bible, although not all versions include it, and there are variations in the exact locations in which it is used. There is also variation in the glyph used: it often has the appearance of a rotated or reversed nun, and is very often called inverted nun; it may also appear similar to a half tet or have some other form.
Bengali Khanda Ta
In Bengali a dead consonant TA shows up as U+09CE BENGALI LETTER KHANDA TA in all contexts except where it is immediately followed by one of the consonants TA, THA, NA, BA, MA, YA, or RA. Khanda-Ta cannot bear a vowel matra or combine with a following consonant to form a conjunct aksara. It can form a conjunct aksara only with a preceding dead consonant RA, with the latter showing up as a REPH placed on the Khanda Ta.
Previous versions of the Unicode Standard recommended that Khanda-Ta be encoded as TA + VIRAMA + ZWJ. Instead, the Khanda-Ta should be used explicitly in new text, but users are cautioned that instances of the old encoding may exist.
Phonetic Extensions: U+1D6C - U+1DBF
Unicode 4.1 adds a significant number of characters used for phonetic transcription and phonetically-based orthographies. The characters in the range U+1D6C - U+1D7F complete the previously existing Phonetic Extensions block. A new Phonetic Extensions Supplement block has also been added, with the range U+1D80 - U+1DBF.
The phonetic extensions for Unicode 4.1 are derived from a wide variety of sources, including many technical orthographies developed by SIL linguists, as well as older historic sources.
Of particular note, all attested phonetic characters showing struckthrough tildes, struckthrough bars, and retroflex or palatal hooks attached to the basic letter have been separately encoded in the blocks for phonetic extensions. Although separate combining marks exist in the Unicode Standard for overstruck diacritics and attached retroflex or palatal hooks, earlier encoded IPA letters such as U+0268 LATIN SMALL LETTER I WITH STROKE or U+026D LATIN SMALL LETTER L WITH RETROFLEX HOOK have never been been given decomposition mappings in the standard. For consistency, all newly encoded characters are handled analogously to the existing, more common characters of this type, and are not given decomposition mappings.
The Phonetic Extensions Supplement block also contains 37 superscript modifier letters. These complement the much more commonly used superscript modifier letters found in the Spacing Modifer Letters block.
U+1D77 LATIN SMALL LETTER TURNED G and U+1D78 MODIFIER LETTER CYRILLIC EN are used in Caucasian linguistics. U+1D79 LATIN SMALL LETTER INSULAR G is used in older Irish phonetic notation. It is to be distinguished from merely a Gaelic style glyph for U+0067 LATIN SMALL LETTER G.
U+1D7A LATIN SMALL LETTER TH WITH STRIKETHROUGH is a digraphic notation commonly found in some English-language dictionaries, representing the voiceless (inter)dental fricative, as in **th**in. While this character is clearly a digraph, the obligatory strikethrough across two letters distinguishes it from a "th" digraph per se, and there is no mechanism involving combining marks which can easily be used to represent it. A common alternative glyphic form for U+1D7A uses a horizontal bar to strike through the two letters, instead of a diagonal stroke.
Modifier Tone Letters: U+A700 - U+A71F
The Modifier Tone Letters block contains modifier letters used in various schemes for marking tones. These supplement the more commonly used tone marks and tone letters found in the Spacing Modifier Letters block (U+02B0 - U+02FF).
The characters in the range U+A700 - U+A707 are corner tone marks used in the transcription of Chinese. They were invented by Bridgman and Wells Williams in the 1830s. They have little current use, but are seen in a number of old Chinese sources.
The tone letters in the range U+A708 - U+A716 complement the basic set of IPA tone letters (U+02E5 - U+02E9), and are also used in the representation of Chinese tones, for the most part. The dotted tone letters are used to represent short ("stopped") tones. The left-stem tone letters are mirror images of the IPA tone letters, and like those tone letters, can be ligated in sequences of two or three tone letters to represent contour tones. Left-stem versus right-stem tone letters are sometimes used contrastively to distinguish between tonemic and tonetic transcription, or to show the effects of tonal sandhi.
Combining Diacritical Marks Supplement: U+1DC0 - U+1DFF
This block is the supplement to the Combining Diacritical Marks block in the range U+0300 - U+036F. It contains lesser-used combining diacritical marks.
U+1DC0 COMBINING DOTTED GRAVE ACCENT and U+1DC1 COMBINING DOTTED ACUTE ACCENT are marks occasionally seen in some Greek texts. They are variant representations of the accent combinations, dialytika varia and dialytika oxia, respectively. They are, however, encoded separately because they cannot be reliably formed by regular stacking rules involving U+0308 COMBINING DIAERESIS and U+0300 COMBINING GRAVE ACCENT or U+0301 COMBINING ACUTE ACCENT.
U+1DC3 COMBINING SUSPENSION MARK is a combining mark specifically used in Glagolitic. It is not to be confused with a combining breve.
Editorial Marks for Biblical Text Annotation
The Greek text of the New Testament exists in a large number of manuscripts with many textual variants. The most widely used critical edition of the New Testament, the Nestle-Aland edition published by the United Bible Societies (UBS), introduced a set of editorial characters which are regularly used in a number of journals and other publications. As a result, these editorial marks have become the recognized method of annotating the New Testament, and have been encoded in Unicode 4.1 in the range U+2E00..U+2E0D.
CJK Additions
Characters have been added to complete roundtrip mapping support for HKSCS and GB 18030. Some of these characters can be found in a new CJK Basic Strokes block (U+31C0..U+31EF), in a new Vertical Forms block (U+FE10..U+FE1F), and as a range extension to CJK Unified Ideographs (U+9FA6..U+9FBB). Other new characters are found in symbol blocks (U+23DA..U+23DB). Parsers and other code may need to adjust for the change of the end of the CJK Unified Ideographs range from U+9FA5 to U+9FBB.
Characters in the CJK Basic Strokes block are single-stroke components of CJK ideographs. The first characters assigned to this block are 16 HKSCS-2001 characters.
A new collection of 106 CJK compatibility ideographs has been added to support roundtrip mapping to the DPRK standard.
Ancient Greek Additions
Ancient Greek Numbers: U+10140-U+1018F
Many symbols have been added to Unicode 4.1 to enable the complete coverage of Ancient Greek acrophonic numeric representation. This includes all known dialectal variants. In addition, a set of Ancient Greek papyrological numbers has been added.
Ancient Greek Editorial Marks
Ancient Greek scribes generally wrote in continuous uppercase letters without separating letters into words. On occasion the scribe added punctuation to indicate the end of a sentence or a change of speaker, or to separate words. Editorial and punctuation characters appear abundantly in surviving papyri and have been rendered in modern typography when possible, often exhibiting considerable glyphic variation. A number of these editorial marks are encoded in the range U+2E0E..U+2E16.
Ancient Greek Musical Notation: U+1D200 - U+1D24F
Ancient Greek had complete sets of vocal and instrumental notation symbols. These were based on Greek letters —comparable to the modern usage of the Latin letters A through G to refer to notes of the Western musical scale. However, rather than using a sharp and flat notation to indicate semitones, or casing and other diacritics to indicate distinct octaves, the Ancient Greek system extended the basic Greek alphabet by rotating and flipping letterforms in various ways, and by adding a few more symbols not directly based on a letter.
Ancient Greek musical notation had a separate system for vocal notation and for instrumental notation; each has a traditional catalog numbering system used by modern scholars of Ancient Greek. In the Unicode Standard, the two systems are unified against each other and against the basic Greek alphabet, based on shape. Thus, if a note is to be represented for the vocal notation system by a Greek letterform, not rotated or flipped, then the corresponding letter from the Greek alphabet in the Greek and Coptic block should be used instead, using an appropriate font to match the archaic letterforms used in the notational system.
If a symbol is used in both the vocal notation system and the instrumental notation system, its Unicode character name is based on the vocal notation system catalog number. Thus U+1D20D GREEK VOCAL NOTATION SYMBOL-14 has a glyph based on an inverted capital lambda. In the vocal notation system, it represents the first sharp of B, and in the instrumental notation system, it represents the first sharp of d'. Since it is used in both systems, its name is based on its sequence in the vocal notation system, rather than its sequence in the instrumental notation system. The character names list in the Unicode Character Database is fully annotated with the functions of the symbols for each system.
The combining marks encoded in the range U+1D242 - U+1D244 are placed over the vocal or instrumental notation symbols and are used to indicate metrical qualities.
Georgian Nuskhuri: U+2D00 - U+2D2F
The Georgian script form Nuskhuri was added in Unicode 4.1. The Georgian script has two related forms. The ecclesiastical form, Khutsuri, has an uppercase, inscriptional form, called Asomtavruli, and a lowercase, cursive, manuscript form called Nuskhuri. The modern, ordinary form, Mkhedruli, is caseless. Prior to Unicode 4.1, secular (Mkhedruli) and ecclesiastical (Khutsuri) styles of Georgian were considered font styles. Both Mkhedruli text and Nuskhuri text were represented using the character range U+10D0..U+10F8. Beginning with Unicode 4.1, Nuskhuri is separately represented using the new Georgian Supplement block, U+2D00..U+2D2F, and the characters in the range U+10D0.. U+10F8 should be restricted to use for Mkhedruli text. Case mappings are now provided between the two Khutsuri forms: Asomtavruli and Nuskhuri.
In addition, three Mkhedruli characters which are used in the transcription of some East Caucasian languages were added.