[Python-3000] Support for PEP 3131 (original) (raw)

Jim Jewett jimjjewett at gmail.com
Fri May 25 16:56:52 CEST 2007


On 5/24/07, Guillaume Proux <gproux+py3000 at gmail.com> wrote:

Hi Jim, On 5/25/07, Jim Jewett <jimjjewett at gmail.com> wrote: > It isn't strictly security; when I've been burned by cut-and-paste > that turned out to be an unexpected character, it didn't cause damage, > but it did take me a long time to debug.

Can you give a longer explanation because I don't understand what is the issue. Is it like the issue with confusing 0 and O ? You seemingly already have an experience with using something that is now not legal in Python. Was it in Java or .NET world?

The really hard-to-debug ones were usually in C. It happened more when I was less experienced, or the available tools were limited.

They usually involved something that looked like a quote mark, but wasn't. (I worry about the characters that look like a less-than sign, but I've never had trouble with them in practice. Problems with other punctuation were rare enough that I can't say they were worse than "." vs "," or ":" vs ";".)

This would be less of a problem in python because it takes triple-quotes to continue a line string across multiple lines -- but it would still be an occasional problem.

This would be less of a problem if I had started out smarter, or I if never worked with people who used presentation-focused editors (like MS Word) when discussing code, but those are only theoretical possibilities.

> For most people, the appearance of a Greek or Japanese (let alone > both) character would be more likely to indicate a typo. If you know > that your project is using both languages, then just allow both; the > point is that you have made an explicit decision to do so.

* Python is dynamic (you can have a e.g. pygtk user interface which enables you to load at runtime a new .py file even to use a text view to type in a mini-script that will do something specific in your application domain): you never know what will get loaded next

I am not missing that -- that is the situation I worry about most. If I'm running something that new, and I've only inspected it visually, I want a great big warning about unexpected characters that merely look like what I thought they were.

No, this won't happen often -- but like threading race conditions, that almost makes it worse. Because it is rare, people won't remember to check for it unless the check is an automated default.

If I were in a Japanese environment, regularly getting code written in Japanese, then Japanese code would be fine, so I would set my environment to accept Japanese -- but I would still get that warning for something with that appears Latin but actually contains Cyrillic.

* Python is embeddable: and often it is to bring the power of python to less sophisticated users. You can imagine having a global system deployed all around the world by a global company enabling each user in each subsidiary to create their own extension scripts.

If they can supply their own scripts, they can supply their own data files -- including an acceptable characters table. But they wouldn't really need to -- realistically, the acceptable characters would be a corporate (or at least site-wide) policy decision that could be set at install time.

* There is a runtime cost for checking: the speed vs. security tradeoff

True, but if speed is that important, than ASCII-only is better; the initial file reading will happen faster, as will the parsing to characters, and the deciding whether characters can be part of an identifier. Even a blind "Anything code point greater than 127 is always allowed" is still slower than not having to consider those code points.

Once you start saying "letters and digits only", you need a per-character lookup, and the difference between "in this set of 4000 out of several million" vs "in this set of several million out of several more million" doesn't need to slow things down.

(for a security benefit that is still very much hypothetical in the face of the experience of Java and .NET people)

(a) Aren't those compile languages, rather than interpreted? So a misleadingly-named identifier doesn't matter as much, because people aren't looking at the source anyhow. (b) How do you know there haven't been problems that just weren't caught? (Perhaps more of the "wonder why that errored out" variety than security breaches.)

* In real life, you won't see much python programs that are not written in your script.

Exactly. So when you do, they should be flagged.

-jJ



More information about the Python-3000 mailing list