[Python-Dev] PEP 393 Summer of Code Project (original) (raw)

Stephen J. Turnbull turnbull at sk.tsukuba.ac.jp
Thu Aug 25 02:31:30 CEST 2011

Previous message: [Python-Dev] PEP 393 Summer of Code Project
Next message: [Python-Dev] PEP 393 Summer of Code Project
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Terry Reedy writes:

Please suggest a re-wording then, as it is a bug for doc and behavior to disagree.

Strings contain Unicode code units, which for most purposes can be
treated as Unicode characters.  However, even as "simple" an
operation as "s1[0] == s2[0]" cannot be relied upon to give
Unicode-conforming results.

The second sentence remains true under PEP 393.

For the purpose of my sentence, the same thing in that code points correspond to characters,

Not in Unicode, they do not. By definition, a small number of code points (eg, U+FFFF) never did and never will correspond to characters.

On computers, characters are represented by code points. What about the other way around? http://www.unicode.org/glossary/#C says code point:

i in range(0x11000)

"A value, or position, for a character" (To muddy the waters more, 'character' has multiple definitions also.) You are using 1), I am using 2) ;-(.

No, you're not. You are claiming an isomorphism, which Unicode goes to great trouble to avoid.

I think you have it backwards. I see the current situation as the purity of the C code beating the practicality for the user of getting right answers.

Sophistry. "Always getting the right answer" is purity.

The thing is, that 90% of applications are not really going to care about full conformance to the Unicode standard.

I remember when Intel argued that 99% of applications were not going to be affected when the math coprocessor in its then new chips occasionally gave 'non-standard' answers with certain divisors.

In the case of Intel, the people who demanded standard answers did so for efficiency reasons -- they needed the FPU to DTRT because implementing FP in software was always going to be too slow. CPython, IMO, can afford to trade off because the implementation will necessarily be in software, and can be added later as a Python or C module.

I believe my scheme could be extended to solve [conformance for composing characters] also. It would require more pre-processing and more knowledge than I currently have of normalization. I have the impression that the grapheme problem goes further than just normalization.

Yes and yes. But now you're talking about database lookups for every character (to determine if it's a composing character). Efficiency of a generic implementation isn't going to happen.

Anyway, in Martin's rephrasing of my (imperfect) memory of Guido's pronouncement, "indexing is going to be O(1)". And Nick's point about non-uniform arrays is telling. I have 20 years of experience with an implementation of text as a non-uniform array which presents an array API, and everything needs to be special-cased for efficiency, and any small change can have show-stopping performance implications.

Python can probably do better than Emacs has done due to much better leadership in this area, but I still think it's better to make full conformance optional.

Previous message: [Python-Dev] PEP 393 Summer of Code Project
Next message: [Python-Dev] PEP 393 Summer of Code Project
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list