UTR #11: East Asian Width (original) (raw)
Unicode Technical Report #11
Revision | 5.0 |
---|---|
Authors | Asmus Freytag (asmus@unicode.org) |
Date | 1999-11-09 |
This Version | http://www.unicode.org/unicode/reports/tr11-5 |
Previous Version | http://www.unicode.org/unicode/reports/tr11-4 |
Latest Version | http://www.unicode.org/unicode/reports/tr11 |
Summary
This report presents the specifications of a informative property for Unicode characters that is useful when interoperating with East Asian Legacy character sets.
Status
This document contains informative material which has been considered and approved by the Unicode Technical Committee for publication as a Technical Report and as part of the Unicode Standard, Version 3.0. Any reference to version 3.0 of the Unicode Standard automatically includes this technical report. Please mail corrigenda and other comments to the author.
The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versionsfor more information.
Contents
- 1 Overview
- 2 Scope
- 3 Description
- 4 Definitions
- 5 Conformance
- 6 Recommendation
- 7 Classifications
- 8 Acknowledgements
- 9 Changes from previous revisions
1 Overview
In mixed-width, East Asian, legacy encodings there is a concept of an inherent width of a character. For a fixed pitch font, this width translates to a display width of either one half or a whole unit width. A common name for this unit width is "Em". It is customarily the _height_of the letter 'M', but since in East Asian fonts the standard character cell is square, it is the same as the unit width.
Note: the character width for a fixed pitch Latin font like Courier is generally 3/5 of an em.
Layout and line breaking (to cite only two examples) in an East Asian context show systematic variations depending on the value of the East-Asian Width property (even for non-fixed pitch fonts). Further, the same information is useful in creating correct transcoding tables for East Asian character sets.
2 Scope
The East Asian Width property provides a useful concept for implementations that
- have to interwork with East Asian legacy character encodings
- support both East Asian and Western typography and line layout
- need to associate fonts with unmarked text runs containing East Asian characters
This Unicode Technical Report does not provide rules or specifications of how this property might be used in font design or line layout, since, while a useful property for this purpose, it is only one of several character properties that would need to be considered.
3 Description
By convention, 1/2 Em wide characters of East Asian legacy encodings are called "half-width" (or hankaku characters in Japanese), the others are called correspondingly "full-width" (or zenkaku) characters. Legacy encodings often use a single byte for the half-width characters and two bytes for the full-width characters. In the Unicode Standard, no such distinction is made, but understanding the distinction is often necessary when interchanging data with legacy systems, especially when fixed size buffers are involved.
Some character blocks in the compatibility zone contain characters that are explicitly marked "half-width" and "full-width" in their character name but for all other characters the width property must be implicitly derived. Some characters behave differently in East Asian context than in non-East Asian content. Their default width property is considered ambiguous and needs to be resolved into an actual width property based on context.
This technical report assigns to each Unicode character one of the six values Ambiguous, Full Width, Half Width, Narrow, Wide, or Not East Asian Neutral (defined below) as its default width property. For any given operation, these six default properties resolve into only two property values narrow and wide, depending on context.
4 Definitions
All terms not defined here shall be as defined in the Unicode Standard.
East Asian Width - in the context of interoperating with East Asian legacy character encodings and implementing East Asian typography, the East Asian Width is a categorization of character. It can take on two abstract values, narrow and wide. In legacy implementations, there is often a corresponding difference in encoding length (one or two bytes) as well as a difference in displayed width. However, the _actual_display width of a glyph is given by the font and may be further adjusted by layout. An important class of fixed width legacy fonts contains glyphs of just two widths with the wider glyphs twice as wide as the narrower glyphs.
Note: For convenience, the classification further distinguishes among explicitly or implicitly wide and narrow characters. Where characters are defined to have an ambiguous East Asian Width, their East Asian Width can be resolved to narrow or wide depending on additional information not contained in the character code.
East Asian FullWidth (F) - characters that are defined as FULL WIDTH and therefore are compatibility equivalents of implicitly narrow but unmarked characters elsewhere in the Unicode Standard.
East Asian Half-width (H) - characters that are explicitly defined as HALF WIDTH in the Unicode Standard and therefore are compatibility characters of implicitly wide, but unmarked characters elsewhere in the Unicode Standard.
East Asian Wide (W) - characters that are implicitly wide (such as the Unified Han Ideographs or Squared Katakana Symbols) because they occur only in the context of East Asian typography where they are wide characters.
East Asian Ambiguous (A) - characters that occur in East Asian legacy character sets as wide characters, but are displayed as narrow(i.e. normal-width) characters in their own local or non-East Asian usage (Examples are the Greek and Cyrillic alphabet found in East Asian character sets, but also some of the mathematical symbols). Ambiguous characters require context to resolve their width. Private Use characters are considered ambiguous, since additional information is required to know whether they should be treated as wide or narrow.
Note: Because East Asian legacy character sets do not always include complete case pairs of Latin characters, two members of a pair may have different East Asian Width properties:
Ambiguous: 01D4 LATIN SMALL LETTER U WITH CARON NEA Neutral: 01D3 LATIN CAPITAL LETTER U WITH CARON
East Asian Narrow (Na) - characters that are implicitly narrow, since they have explicit full-width clones (all of ASCII is an example).
Note: These are characters are marked with (Na) and not (H) in the data table, because it is useful to distinguish characters explicitly defined as half-width from other characters that have a full-width equivalent. In particular, half-width punctuation behaves in some important ways like ideographic punctuation.
Not East Asian (Neutral) - all other characters. Neutral characters do not occur in legacy East Asian character sets. By extension, they also do not occur in East Asian typography. For example, there is no traditional Japanese way of typesetting Devanagari.
Note: Strictly speaking, it makes no sense to talk of narrow and wide for neutral characters, but since for all practical purposes they behave like Na, they are treated as narrow characters (the same as Na) under the recommendations below.
Figure 1: Venn diagram showing the set relations for the five of the six categories.
In a broad sense, wide characters include W, F, and A (when in EA context), while narrow characters include N, Na, H, and A (when not in EA context).
4.1 Relation to the terms "full-width" and "half-width"
When converting a DBCS mixed-width encoding to and from Unicode, the full-width characters in such a mixed-width encoding are mapped to the full-width compatibility characters in the FFxx block, whereas the corresponding half-width characters are mapped to ordinary Unicode characters (e.g. ASCII in U+0021..U+007E, plus a few other scattered characters).
In the context of interoperability with DBCS character encodings, that restricted set of Unicode characters in the General Scripts area can be construed as half-width, rather than full-width. (This applies only to the restricted set of characters which can be paired with the full-width compatibility characters.)
In the context of interoperability with DBCS character encodings, all other Unicode characters which are not explicitly marked as half-width can be construed as full-width.
In any other context, Unicode characters not explicitly marked as being either full-width or half-width compatibility forms are neither half-width or full-width..
Seen in this light, the "half-width" and "full-width" properties are not unitary character properties in the same sense as "space" or "combining" or "alphabetic". They are, instead, relational properties of a pair of characters, one of which is explicitly encoded as a half-width or full-width form for compatibility in mapping to DBCS mixed-width character encodings.
What is "full-width" by default today could in theory become "half-width" tomorrow by the introduction of another character on the SBCS part of a mixed-width code page somewhere, requiring the introduction of another full-width compatibility character to complete the mapping. However, since the single byte part of mixed-width character sets is limited, there are not going to be many candidates and neither UTC and WG2 have any intention to add additional compatibility characters for this purpose.
Ambiguous width characters are all those characters that can occur as full-width characters in any of a number of East Asian legacy character encodings. They have a 'resolved' width of either narrow or wide depending on the context of their use. If they are not used in context of the specific legacy encoding they belong to, their width resolves to narrow. Otherwise it resolves to full-width or half-width. The term context as used here includes extra information such as explicit markup, knowledge of the source codepage, font information, or language identification. For example:
- Greek characters resolve to narrow when used with a standard Greek font, since there is no East Asian legacy context.
- Private use character codes and the replacement character have ambiguous width, since they may stand in for characters of any width.
- Ambiguous quotation marks are generally resolved to wide, when they enclose and are adjacent to a wide characters, and to narrow otherwise.
5 Conformance
East Asian Width is an informative character property and implies no conformance requirements.
6 Recommendation (informative)
When mapping Unicode to legacy character encodings
- Wide Unicode characters always map to full-width characters
- Narrow (and neutral) Unicode characters always map to half-width characters
- Half-width Unicode characters always map to half-width characters
- Ambiguous Unicode characters always map to full-width characters
- Wide Unicode characters never map to non-East Asian legacy character encodings
- Ambiguous Unicode characters always map to regular (narrow) characters in **non-**East Asian legacy character encodings
When processing or displaying data
- Wide characters behave like ideographs in important ways, such as layout. In fixed pitched fonts, they take up one Em of space.
- Half-width characters behave like ideographs in some ways. In fixed pitched fonts, they take up 1/2 Em of space.
- Narrow characters behave like Western characters. For example, in line breaking. In fixed pitched East Asian fonts, they take up 1/2 Em of space.
- Ambiguous characters behave like wide or narrow characters depending on context (language tag, associated font, source of data, or explicit markup; all can provide the context)
7 Classifications (informative)
The classifications presented here are based on the most widely used mixed-width legacy character sets in use in East Asia as of this writing. In particular, the assignment of the neutral or ambiguouscategories depend on the contents of these character sets. For example, an implementation that knows a-priori that it only needs to interchange data with the Japanese Shift-JIS character set, but not other East Asian character sets, could reduce the number of characters in the ambiguousclassification to those actually encoded in Shift-JIS. Or such a reduction could be done implicitly at runtime in the context of interoperating with Shift-JIS fonts or data sources. Conversely, if additional character sets are created and widely adopted for legacy purposes, more characters would need to be classified as ambiguous.
7.1 Unassigned and Private Use characters
All unassigned characters are by default classified as non-East Asian neutral, except for the range U-00020000 to U-0002FFFD, since all code positions from U-00020000 to U-0002FFFD are intended for CJK ideographs (W). All Private use characters are by default classified as ambiguous, since their definition depends on context. This includes surrogate pairs.
7.2 Combining Marks
Combining marks have been classified and are given a property assignment based on their typical applicability. For example combining marks typically applied to characters of class N, Naor W are classified as A. Combining marks for purely non-East Asian scripts are marked as N, and non-spacing marks used only with wide characters are given a W. Even more so than for other characters, the East Asian width property for combining marks is not the same as their display width.
In particular, non-spacing marks do not possess actual advance width. Therefore, even when displaying combining marks, the East Asian Width property cannot be related to the advance width of these characters. However, it can be useful in determining the the encoding length in a legacy encoding, or the choice of font for the range of characters including that non-spacing mark. The width of the glyph image of a non-spacing mark should always be chosen as the appropriate one for the width of the base character.
7.3 Data File
The East Asian Width classification of characters of the Unicode Standard, Version 3.0 is available as the file EastAsianWidth.txtin the Unicode Character Database. This is a tab-delimited three column plain text file, with code position, East Asian Width designator and character name (for reference purpose only). The abbreviated way of listing the Ideographic, Hangul, Surrogate and Private use ranges is the same as in UnicodeData.txt
8 Acknowledgments
Michel Suignard provided extensive input into the analysis and source material for the detail assignments of these properties. Mark Davis and Ken Whistler performed consistency checks on the data files.
9 Changes from previous revisions
Fifth Technical Report Version: Changed the spelling of the title and made minor clarifying changes to the definitions and the description of ambiguous characters and combining marks. As result of the Unicode 3.0 beta process, changed some CJK punctuation characters from W to A since they are also used in Western mathematical notation. Removed some historic information and made other edits to prepare TR for publication as part of Unicode 3.0.
Copyright © 1998-1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.