[Python-3000] Support for PEP 3131 (original) (raw)

Jim Jewett jimjjewett at gmail.com
Fri May 25 19:47:44 CEST 2007


On 5/25/07, Stephen J. Turnbull <stephen at xemacs.org> wrote:

Jim Jewett writes:

> Ideally, it would even be explicit per extra character allowed, though > there should obviously be shortcuts to accept entire scripts.

How about a regexp character class as starting point?

I'm not sure I understand. Do you mean that part of localization should be defining what certain regular expressions should match? That sounds great from a consistency standpoint, but it would certainly limit who could create their own reliable tailorings.

> So how about

> [ ASCII, plus chars in a named table]

You can specify any character you want, but if it's ASCII, or not in the classes PEP 3131 ends up using to define the maximal set, it gets deleted from the extension table (ASCII has its own table, conceptually). This permits whole scripts, blocks, or ranges to be included.

So long as we allow tailoring, I think the maximal set should be generous -- and I don't see any reason to pre-exclude anything outside ASCII.

There are people who like to use names like "Program Files" or "Summary of Results.Apr-3-2007 version 2.xls"; I expect the same will be true of identifiers. So long as the punctuation is not ASCII, we might as well let them. (Internally, I expect some communities to say "that is a bad idea" about certain characters, but I don't want to prejudge which characters those will be.)

> If you want to include punctuation or

Why waste the effort of the Unicode technical committees?

The other committees say to exclude certain scripts, like Linear B and Ogham. And not to allow mixed scripts, at least if they're confusable. But I really don't want to explain why someone using Cyrillic can't use certain (apparently to him) randomly determined identifiers just because it could be confused with ASCII (or Armenian).

The only set the committees always recommend allowing is ASCII; beyond that a nest of decisions (and exceptions) is almost unavoidable, because the committees disagree among themselves. Since we can't be completely safe, I would rather err on the side of leniency towards those concerned enough to make explicit decisions.

> undefined characters, so be it.

-1

Assuming undefined == reserved for future standardization that violates the Unicode standard.

If unicode comes out with a new revision, the new characters should probably be allowed; I don't want a situation where users of Cham or Lepcha[1] are told they have to wait another year because their scripts weren't formally adopted into unicode until after python 3.4.0 was already released.

[1] http://www.unicode.org/onlinedat/languages-scripts.html says that these languages have their own scripts (and no alternate script), and that these scripts have not yet been encoded in unicode. I won't be surprised to see Klingon identifiers before we see either of those, but ... I don't want to contribute to their exclusion.

-jJ



More information about the Python-3000 mailing list