Origin of the U+nnnn notation (original) (raw)

From: Hohberger, Clive (CHohberger@zebra.com)
Date: Tue Nov 08 2005 - 09:00:45 CST

Next message: Philippe Verdy: "Re: Origin of the U+nnnn notation"


Adding to Philippe's excellent description, I think of the set {U+nnnn}
as a set of ordinal numbers, as they represent positions in a table. The
construct U-nnnn therefore is meaningless as an ordinal number.
Clive

-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
Behalf Of Philippe Verdy
Sent: Tuesday, November 08, 2005 8:05 AM
To: Dominikus Scherkl; 'Jukka K. Korpela'; unicode@unicode.org
Subject: Re: Origin of the U+nnnn notation

From: "Dominikus Scherkl" <lyratelle@gmx.de>
>> I have been unable to hunt down the historical origin of the
>> notation U+nnnn (where nnnn are hexadecimal digits) that we
>> use to refer to characters (and code points).
>> Presumably "U" stands for "UCS" or for "Unicode", but where
>> does the plus sign come from?
> Maybe it was thought of as an offset from the unit (character null)
> like in ETA+5 minutes (expected time of arrival was passed five
minutes
> ago - an euphemism for beeing 5 minutes late).

U-nnnn already exists (or I should say, it has existed). It was refering
to
16-bit code units, not really to characters and was a fixed-width
notation
(with 4 hexadecimal digits). The "U" meant "Unicode" (1.0 and before).

U+[n...n]nnnn was created to avoid the confusion with the past 16-bit
only
Unicode 1.0 standard (which was not fully compatible with ISO/IEC 10646
code
points). It is a variable-width notation that refers to ISO/IEC 10646
code
points. The "U" means "UCS" or "Universal Character Set". At that time,
the
UCS code point range was up to 31 bits wide.

The U-nnnn notation is abandoned now, except for references to Unicode
1.0.
If one uses it, it will refer to one or more 16-bit code units needed to

encode each codepoint (possibly with surrogate pairs). It does not
designates abstract characters or codepoints unambiguously.

Later, the variable-width U+[n...n]nnnn notation was restricted to allow

only codepoints in the 17 first planes of the joined ISO/IEC 10646-1 and

Unicode standards (so the only standard codepoints are between U+0000
and
U+10FFFF, some of them being permanently assigned to non-characters).

The references to larger code points with U+[n...n]nnnn is discouraged,
as
they no longer designate valid code points in both standards. Their
definition and use is then application-specific.

There are '''no''' negative codepoints in either standards (U-0001 does
not
designate the 32-bit code unit that you could store in a signed
wide-char
datatype, but in past standard it designated the same codepoint as
U+0001
now). Using "+" makes the statement about signs clear: standard code
points
all have positive values.

So if you want a representation for negative code units, you need
another
notation (for example N-0001 to represent the negative code unit with
negative value -1): this notation is application-specific.

- CONFIDENTIAL-
This email and any files transmitted with it are confidential, and may also be legally privileged. If you are not the intended recipient, you may not review, use, copy, or distribute this message. If you receive this email in error, please notify the sender immediately by reply email and then delete this email.



This archive was generated by hypermail 2.1.5: Tue Nov 08 2005 - 09:02:42 CST