[Python-3000] Invalid \U escape in source code give hard-to-trace error (original) (raw)
Guido van Rossum guido at python.org
Wed Jul 18 19:31:53 CEST 2007
- Previous message: [Python-3000] Invalid \U escape in source code give hard-to-trace error
- Next message: [Python-3000] Invalid \U escape in source code give hard-to-trace error
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 7/17/07, "Martin v. Löwis" <martin at v.loewis.de> wrote:
> When a source file contains a string literal with an out-of-range \U > escape (e.g. "\U12345678"), instead of a syntax error pointing to the > offending literal, I get this, without any indication of the file or > line: > > UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in > position 0-9: illegal Unicode character > > This is quite hard to track down.
I think the fundamental flaw is that a codec is used to implement the Python syntax (or, rather, lexical rules). Not quite sure what the rationale for this design was; doing it on the lexical level is (was) tricky because \u escapes were allowed only for Unicode literals, and the lexer had no knowledge of the prefix preceding a literal. (In 3k, it's still similar, because \U escapes have no effect in bytes and raw literals). Still, even if it is "only" handled at the parsing level, I don't see why it needs to be a codec. Instead, implementing escapes in the compiler would still allow for proper diagnostics (notice that in the AST the original lexical form of the string literal is gone).
I guess because it was deemed useful to have a codec for this purpose too, thereby exposing the algorithm to Python code that needs the same functionality (e.g. the compiler package, RIP).
> (Both the location of the bad > literal in the source file, and the origin of the error in the parser. > :-) Can someone come up with a fix?
The language definition makes it difficult to fix it where I would consider the "proper" place, i.e. in the tokenization: http://docs.python.org/ref/strings.html says that escapeseq is "" . So "\x" is a valid shortstring. Then it becomes fuzzy: It says that any unrecognized escape sequences are left in the string. While that appears like a clear specification, it is not implemented (and has not since Python 2.0 anymore). According to the spec, '\U12345678' is well-formed, and denotes the same string as '\U12345678'. I now see the following choices: 1. Restore implementing the spec again. Stop complaining about _invalid escapes for \x and \U, and just interpret the _ as '\'. In this case, the current design could be left in place, and the codecs would just stop raising these errors.
Sounds like a bad idea. I think \xNN (where N is not a hex digit) once behaved this way, and it was changed to explicitly complain instead as a service to users.
2. Change the spec to make it an error if \x is not followed by two hex digits, \u not by four hex digits, \U not by 8, or the value denoted by the \U digits is out of range. In this case, I would propose to move the lexical analysis back into the parser, or just make an internal API that will raise a proper SyntaxError (it will be tricky to compute the column in the original source line, though).
I'm all in favor of this spec change. Eventually we should change the lexer to do this right; for now, Kurt's patch is good enough.
3. Change the spec to make constrain escapeseq, giving up the rule that uninterpreted escapes silently become two characters. That's difficult to write down in EBNF, so should be formulated through constraints in natural language. The lexer would have to keep track of what kind of literal it is processing, and reject invalid escapes directly on source level.
-1
There are probably other options as well.
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
- Previous message: [Python-3000] Invalid \U escape in source code give hard-to-trace error
- Next message: [Python-3000] Invalid \U escape in source code give hard-to-trace error
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]