[Python-Dev] UCS2/UCS4 default (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Sat Jul 5 07:35:18 CEST 2008

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The premise is the OP's idea that Python should switch to all UCS4 to create a more pure ('ideal') situation or the idea that len(s) should count codepoints (correct term?) for all builds as a matter of purity even though on it would be time-costly on 16-bit builds as a matter of practicality.

No Terry did definitely mean Unicode scalar values.

True. However, using the word "code point" to refer to "Unicode scalar values" is also correct. He (rather, the OP) wanted to count code points (i.e. not count code units).

Practical len() returns the number of code units of the internal storage format.

No, it returns the number of code units.

Pure len() allegedly would return the number of Unicode scalar values (obviously a surrogate pair would be considered a single Unicode scalar value).

Perhaps-not-so-obviously-but-still-intendended, a pure len counting surrogate pairs as one would also count code points.

Please keep in mind that encodings encode Unicode scalar values.

A "coded character set" is "a character set in which each character is assigned a numeric code point". So clearly, a character encoding form encodeds code points.

Thus a utf-8 code unit sequence (or UTF-32 code unit) that would give a code point in the surrogate sections is technically in error.

Sure, but this has nothing to do with Terry's terminology use.

Regards, Martin

Previous message: [Python-Dev] UCS2/UCS4 default
Next message: [Python-Dev] UCS2/UCS4 default
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list