[Python-Dev] PEP 393 Summer of Code Project (original) (raw)
Ezio Melotti ezio.melotti at gmail.com
Fri Aug 26 03:40:33 CEST 2011
- Previous message: [Python-Dev] PEP 393 Summer of Code Project
- Next message: [Python-Dev] PEP 393 Summer of Code Project
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Aug 26, 2011 at 1:54 AM, Guido van Rossum <guido at python.org> wrote:
On Wed, Aug 24, 2011 at 3:06 AM, Terry Reedy <tjreedy at udel.edu> wrote: > Excuse me for believing the fine 3.2 manual that says > "Strings contain Unicode characters." (And to a naive reader, that implies > that string iteration and indexing should produce Unicode characters.)
The naive reader also doesn't know the difference between characters, code points and code units. It's the advanced, Unicode-aware reader who is confused by this phrase in the docs. It should say code units; or perhaps code units for narrow builds and code points for wide builds.
For UTF-16/32 (i.e. narrow/wide), talking about "code units"[0] should be correct. Also note that:
- for both, every "code unit" has a specific "codepoint" (including lone surrogates), so it might be OK to talk about "codepoints" too, but
- only for wide builds every "codepoints" is represented by a single, 32-bits "code unit". In narrow builds, non-BMP chars are represented by a "code unit sequence" of two elements (i.e. a "surrogate pair").
Since "code unit" refers to the minimal bit combination, in UTF-8 characters that needs 2/3/4 bytes, are represented with a "code unit sequence" made of 2/3/4 "code units" (so in UTF-8 "code units" and "code points" overlaps only for the ASCII range).
With PEP 393 we can unconditionally say code points, which is much better. We should try to remove our use of "characters" -- or else we should define our use of the term "characters" as "what the Unicode standard calls code points".
Character usually works fine, especially for naive readers. Even Unicode-aware readers often confuse between the several terms, so using a simple term and pointing to a more accurate description sounds like a better idea to me.
Note that there's also another important term[1]: """ Unicode Scalar Value. Any Unicode * code point<http://unicode.org/glossary/#code_point>
- except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. """ For example the UTF codecs produce sequences of "code units" (of 8, 16, 32 bits) that represent "scalar values"[2][3]:
Chapter 3 [4] says: """ 3.9 Unicode Encoding Forms The Unicode Standard supports three character encoding forms: UTF-32, UTF-16, and UTF-8. Each encoding form maps the Unicode code points U+0000..U+D7FF and U+E000..U+10FFFF to unique code unit sequences. [...] D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points. • As a result of this definition, the set of Unicode scalar values consists of the ranges 0 to D7FF and E000 to 10FFFF, inclusive. D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. [...] D79 A Unicode encoding form assigns each Unicode scalar value to a unique code unit sequence. """
On the other hand, Python Unicode strings are not limited to scalar values, because they can also contain lone surrogates.
I hope this helps clarify the terminology a bit and doesn't add more confusion, but if we want to use the Unicode terms we should get them right. (Also note that I might have misunderstood something, even if I've been careful with the terms and I double-checked and quoted the relevant parts of the Unicode standard.)
Best Regards, Ezio Melotti
[0]: From the chapter 3 [4], D77 Code unit: The minimal bit combination that can represent a unit of encoded text for processing or interchange. • Code units are particular units of computer storage. Other character encoding standards typically use code units defined as 8-bit units—that is, octets. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. [1]: http://unicode.org/glossary/#unicode_scalar_value [2]: Apparently Python 3 raises an error while encoding lone surrogates in UTF-8, but it doesn't for UTF-16 and UTF-32.
From the chapter 3 [4], D91: "Because surrogate code points are not Unicode scalar values, isolated UTF-16 code units in the range 0xD800..0xDFFF are ill-formed." D92: "Because surrogate code points are not included in the set of Unicode scalar values, UTF-32 code units in the range 0x0000D800..0x0000DFFF are ill-formed." I think this should be fixed. [3]: Note that I'm talking about codecs used to encode/decode Unicode strings to/from bytes here, it's perfectly fine for Python itself to represent lone surrogates in its internal representations, regardless of what encoding it's using. [4]: Chapter 3: http://www.unicode.org/versions/Unicode6.0.0/ch03.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20110826/1959223e/attachment.html>
- Previous message: [Python-Dev] PEP 393 Summer of Code Project
- Next message: [Python-Dev] PEP 393 Summer of Code Project
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]