[Python-3000] Invalid \U escape in source code give hard-to-trace error (original) (raw)

"Martin v. Löwis" martin at v.loewis.de
Wed Jul 18 05:36:05 CEST 2007

Previous message: [Python-3000] Invalid \U escape in source code give hard-to-trace error
Next message: [Python-3000] Invalid \U escape in source code give hard-to-trace error
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

When a source file contains a string literal with an out-of-range \U escape (e.g. "\U12345678"), instead of a syntax error pointing to the offending literal, I get this, without any indication of the file or line:

UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position 0-9: illegal Unicode character This is quite hard to track down.

I think the fundamental flaw is that a codec is used to implement the Python syntax (or, rather, lexical rules).

Not quite sure what the rationale for this design was; doing it on the lexical level is (was) tricky because \u escapes were allowed only for Unicode literals, and the lexer had no knowledge of the prefix preceding a literal. (In 3k, it's still similar, because \U escapes have no effect in bytes and raw literals).

Still, even if it is "only" handled at the parsing level, I don't see why it needs to be a codec. Instead, implementing escapes in the compiler would still allow for proper diagnostics (notice that in the AST the original lexical form of the string literal is gone).

(Both the location of the bad literal in the source file, and the origin of the error in the parser. :-) Can someone come up with a fix?

The language definition makes it difficult to fix it where I would consider the "proper" place, i.e. in the tokenization:

http://docs.python.org/ref/strings.html

says that escapeseq is "" . So "\x" is a valid shortstring.

Then it becomes fuzzy: It says that any unrecognized escape sequences are left in the string. While that appears like a clear specification, it is not implemented (and has not since Python 2.0 anymore). According to the spec, '\U12345678' is well-formed, and denotes the same string as '\U12345678'.

I now see the following choices:

Restore implementing the spec again. Stop complaining about invalid escapes for \x and \U, and just interpret the
as '\'. In this case, the current design could be left in place, and the codecs would just stop raising these errors.
Change the spec to make it an error if \x is not followed by two hex digits, \u not by four hex digits, \U not by 8, or the value denoted by the \U digits is out of range. In this case, I would propose to move the lexical analysis back into the parser, or just make an internal API that will raise a proper SyntaxError (it will be tricky to compute the column in the original source line, though).
Change the spec to make constrain escapeseq, giving up the rule that uninterpreted escapes silently become two characters. That's difficult to write down in EBNF, so should be formulated through constraints in natural language. The lexer would have to keep track of what kind of literal it is processing, and reject invalid escapes directly on source level. There are probably other options as well.

Regards, Martin

Previous message: [Python-3000] Invalid \U escape in source code give hard-to-trace error
Next message: [Python-3000] Invalid \U escape in source code give hard-to-trace error
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-3000 mailing list