[Python-Dev] UCS2/UCS4 default (original) (raw)

Joe Smith unknown_kev_cat at hotmail.com
Sat Jul 5 01:20:34 CEST 2008


Martin v. Löwis <martin v.loewis.de> writes:

> Wrong term - code units and code points are equivalent in UTF-16 and > UTF-32. What you're looking for is unicode scalar values. How so? Section 2.5, UTF-16 says "code points in the supplementary planes, in the range U+10000..U+10FFFF, are represented as pairs of 16-bit code units." So clearly, code points in Unicode range from U+0000..U+10FFFF, independent of encoding form. In UTF-16, code units range from 0..65535. OTOH, "unicode scalar value" is nearly synonymous to "code point": D76 Unicode Scalar Value. Any Unicode code point except high-surrogate and low-surrogate code points. So codepoint in Terry's message was the right term.

No Terry did definitely mean Unicode scalar values. He was describing the "pure" but impractical "len()" that would count a surrogate pair as "1", not 2, even in the 32-bit builds.

For what it is worth: Code point: a number between 0 and 1114111. Scalar Value: a code point, except the surrogate code points. Code unit: The basic unit of the encoding. One code unit is always sufficient to encode some Unicode Scalar values. However, other Unicode scalar values may require multiple Code units.

Note that a scalar value is a code point. A code point may or may not be a scalar value.

Practical len() returns the number of code units of the internal storage format. Pure len() allegedly would return the number of Unicode scalar values (obviously a surrogate pair would be considered a single Unicode scalar value).

Please keep in mind that encodings encode Unicode scalar values. Thus a utf-8 code unit sequence (or UTF-32 code unit) that would give a code point in the surrogate sections is technically in error. (Although python would do well to ignore this restriction as there may be valid reasons to have a utf-8 sequence that is not a valid encoded Unicode text sequence)



More information about the Python-Dev mailing list