[Python-3000] Support for PEP 3131 (original) (raw)

Jim Jewett jimjjewett at gmail.com
Tue May 22 22:29:02 CEST 2007


On 5/22/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:

That's why Java and C++ use \u, so you would write L\u00F6wis as an identifier. ... I think you are really arguing for \u escapes in identifiers here.

Yes, that is effectively what I was suggesting.

This is truly unambiguous. I claim that it is also useless.

It means users could see the usability benefits of PEP3131, but the python internals could still work with ASCII only.

It simplifies checking for identifiers that don't stick to ASCII, which reduces some of the concerns about confusable characters, and which ones to allow.

Short list of judgment calls that we need to resolve if we go with non-ASCII identifiers, but can largely ignore if we just use escaping:

Based only on UAX 31:

ID vs XID  (unicode changed their mind on recommendations)

include stability extensions?  (*Python* didn't allow those

letters previously.)

which of ID_CONTINUE should be left out.  (We don't want "-", and

some of the punctuation and other marks may be closer to "-" than to "_". Or they might not be, and I don't know how to judge that.)

layout and control charcters (At the top of section 2, tr31

recommends acting as though they weren't there ... but if we use a normal (unicode) string, then they will still affect the hash. Down in 2.2, they say not to permit them, except sometimes...)

Canonicalization

Combining Marks should be accepted (only as continuation chars),

but not if they're enclosing marks, because ... well, I'm not sure, but I'll have to trust them.

Specific character Adjustments (sec 2.3) -- The example suggests

that we might have to tailor for our use of "_", though I didn't get that from the table. They do suggest tailoring out certain Decomposition Types.

Additional (non-letter?) characters which may occur in words (see

UAX29, but I don't claim to fully understand it)

Undefined code points, particularly those which might be defined later?

Should we exclude the letters that look like punctuation?  A

proposed update (http://www.unicode.org/reports/tr31/tr31-8.html) mentions U+02B9 (modifier letter prime) only because the visually equivalent U+0374 (Greek Numeral Sign) shouldn't be an identifier, but does fold to it under (some?) canonicalization. (They suggest allowing both, instead of neither.)

Then TR 39 http://www.unicode.org/reports/tr39/ recommends excluding (most, but not all of)

characters not in modern use;

characters only used in specialized fields, such as liturgical

characters, mathematical letter-like symbols, and certain phonetic alphabetics;

and ideographic characters that are not part of a set of core CJK

ideographs consisting of the CJK Unified Ideographs block plus IICore (the set of characters defined by the IRG as the minimal set of required ideographs for East Asian use).

They summarize this in http://www.unicode.org/reports/tr39/data/xidmodifications.txt; I wouldn't add the hyphen-minus back in, but I don't know whether katakana middle dot should be allowed.

Should mixed-script identifiers be allowed? According to TR 36 (http://www.unicode.org/reports/tr36/) ASCII only is the safest, and that is followed by limits on mixed-script identifiers. Those limits sound reasonable to me, but ... I'm not the one who would be mixing them.

Note that even "highly restrictive" allows ASCII + Han + Hiragana + Katakana, ASCII + Han + Bopomofo, and ASCII + Han + Hangul. (I think we wanted at least the ASCII numbers with anything.)

-jJ



More information about the Python-3000 mailing list