Issue 27582: Mispositioned SyntaxError caret for unknown code points (original) (raw)

Created on 2016-07-21 11:54 by ncoghlan, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
method1-change-cur.patch Rosuav,2016-07-21 14:19 review
method2-change-cur-and-inp.patch Rosuav,2016-07-21 14:19 review
method3-change-all-errors.patch Rosuav,2016-07-21 14:19 review
method4-change-all-errors-if-possible.patch Rosuav,2016-07-21 14:19 review
where-did-that-error-come-from.patch Rosuav,2016-07-21 16:33 review
combined.patch Rosuav,2016-07-21 16:35 review
Messages (12)
msg270914 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2016-07-21 11:54
Reporting by Rustom Mody on python-ideas, the SyntaxError caret is confusingly mispositioned when an invalid Unicode codepoint appears as part of a larger sequence of invalid codepoints and/or valid identifier characters: >>> varname = “d“a”t”apoint File "", line 1 varname = “d“a”t”apoint ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint.evidence File "", line 1 varname = “d“a”t”apoint.evidence ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint[evidence] File "", line 1 varname = “d“a”t”apoint[evidence] ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint(evidence) File "", line 1 varname = “d“a”t”apoint(evidence) ^ SyntaxError: invalid character in identifier If the invalid character is a non-identifiers ASCII character, the error message loses the trailing "in identifier" phrase and points to the correct place: >>> varname = ddda$t$apoint File "", line 1 varname = ddda$t$apoint ^ SyntaxError: invalid syntax >>> varname = d$a$t$apoint File "", line 1 varname = d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = ^d$a$t$apoint File "", line 1 varname = ^d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = !d$a$t$apoint File "", line 1 varname = !d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = `d$a$t$apoint File "", line 1 varname = `d$a$t$apoint ^ SyntaxError: invalid syntax
msg270921 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2016-07-21 13:26
This looks like a duplicate of issue 2382.
msg270922 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 13:28
The question was raised that there might be a problem with (UTF-8) bytes vs characters, but that's definitely not it - pythonrun.c:1362 UTF-8-decodes the line of source and then gets its character length to use as the new offset. So I don't think this is a duplicate of 2382. (Side point: There appears to be quite a bit of complexity inside the CPython parser to cope with the fact that it does everything in UTF-8 bytes rather than simply decoding to text and lexing that. I presume that's for the sake of efficiency - that it'd be too slow to work through PyUnicode everywhere?) Am looking into the rest.
msg270927 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 14:19
Actually pinpointing the invalid character may be impractical, as there are two boolean situations: either a UnicodeDecodeError (because you had an invalid UTF-8 stream), or PyUnicode_IsIdentifier returns false. Either way, it applies to the whole identifier. So there are a few possibilities, corresponding to the patches I'm attaching. 1) Change the way this one specific error is handled, in tokenizer.c verify_identifier(). If it finds an error, adjust tok->cur to point to the beginning of it. No new failures in test suite. 2) As above, but also change tok->inp, because of this comment in tokenizer.h:31 /* NB If done != E_OK, cur must be == inp!!! */ which I have no idea about the meaning of. This results in truncated error messages, but suggests that method 1 might be breaking an invariant that results in breakage elsewhere. If there are, though, they're not exercised by 'make test', which itself may be a problem. No new test failures. 3) Change the handling of ALL parser errors, in parsetok.c parsetok(), so now they all point to tok->start. Octal literals with 8s or 9s in them now get the caret pointing to the invalid digit, rather than the end of the literal. Unterminated strings point to the opening quote. And some forms of IndentationError now segfault Python. Test suite fails (unsurprisingly). 4) In response to the above segfault, hack it back to the old way of doing things if there's no tok->start. Maybe the condition should be done differently? No new failures in the test suite. I'd ideally like to use the technique from method 3 (either as patch 4 or with some other guard condition). Failing that, can anyone explain the "NB" above, and what ought to be done to comply with it?
msg270929 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 14:22
BTW, here's how a session looks using method 4's change: >>> varname = asdf“d“a”t”apoint File "", line 1 varname = asdf“d“a”t”apoint ^ SyntaxError: invalid character in identifier >>> varname = asdf“d“a”t”apoint.evidence File "", line 1 varname = asdf“d“a”t”apoint.evidence ^ SyntaxError: invalid character in identifier >>> varname = ddda$t$apoint File "", line 1 varname = ddda$t$apoint ^ SyntaxError: invalid syntax >>> varname = asdf$d$a$t$apoint File "", line 1 varname = asdf$d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = 0o1234987654 + 123 File "", line 1 varname = 0o1234987654 + 123 ^ SyntaxError: invalid syntax >>> varname = "asdf" + "qwer File "", line 1 varname = "asdf" + "qwer ^ SyntaxError: EOL while scanning string literal
msg270933 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2016-07-21 15:06
Looking at issue 2382, I agree that's a different problem (I'm seeing the current misbehaviour even though everything is consistently encoded as UTF-8) The main case we're interested in here is the PyUnicode_IsIdentifier one, so if we wanted to do better than "start or end of the token", we could introduce a new internal "_PyUnicode_FindNonIdentifier" that reported the position of the first non-identifier character (or -1 if it's a valid identifier). Unfortunately, I'm not at all familiar with parsetok.c myself (my own work with the code generator has been from the AST on), so I don't have a ready answer for your other questions.
msg270940 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 16:33
Hmm, that'd be curious. The code to do that is actually pretty simple - see attached patch - but actually using that to affect error messages is a bit harder. Is it safe to mess with tok->start?
msg270941 - (view) Author: Chris Angelico (Rosuav) * Date: 2016-07-21 16:35
Attached is a combined patch that has the new private function for IsIdentifier, method 4's error handling change, and a bit of glue in the middle to make use of it. The result is a passing test suite (bar test_site which was already failing on my system) and a caret that points to the errant Unicode character, or the beginning of the identifier if it's a decode problem.
msg270946 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-07-21 17:15
I still think the easiest thing to do would be to make all non-ASCII characters instances of "invalid_character_token", self-delimiting in the same way that operators are. That would automatically point to exactly the right place in the token stream, and requires zero changes to the error handling code. I don't have time to look at the code, but I suspect that you could handle this exactly the same way that ? and $ are handled, and maybe even use the same token type.
msg270986 - (view) Author: Adam Bartoš (Drekin) * Date: 2016-07-22 13:30
Maybe this is related: http://bugs.python.org/issue26152.
msg271138 - (view) Author: Alyssa Coghlan (ncoghlan) * (Python committer) Date: 2016-07-24 08:30
Drekin, agreed, that looks like the same problem. Since this one has draft patches attached, I'll mark the other one as a duplicate.
msg406869 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-11-23 17:42
This seems to have been fixed by now, I get this on 3.11: >>> varname = “d“a”t”apoint File "", line 1 varname = “d“a”t”apoint ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint.evidence File "", line 1 varname = “d“a”t”apoint.evidence ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint[evidence] File "", line 1 varname = “d“a”t”apoint[evidence] ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint(evidence) File "", line 1 varname = “d“a”t”apoint(evidence) ^ SyntaxError: invalid character '“' (U+201C)
History
Date User Action Args
2022-04-11 14:58:34 admin set github: 71769
2021-12-03 17:16:24 iritkatriel set status: pending -> closedstage: needs patch -> resolved
2021-11-23 17:42:44 iritkatriel set status: open -> pendingnosy: + iritkatrielmessages: + resolution: fixed
2016-07-24 08:33:41 ncoghlan link issue26152 superseder
2016-07-24 08:30:51 ncoghlan set messages: +
2016-07-22 13:30:44 Drekin set nosy: + Drekinmessages: +
2016-07-21 17:15:04 sjt set nosy: + sjtmessages: +
2016-07-21 16:35:15 Rosuav set files: + combined.patchmessages: +
2016-07-21 16:33:21 Rosuav set files: + where-did-that-error-come-from.patchmessages: +
2016-07-21 15:06:14 ncoghlan set messages: +
2016-07-21 14:22:41 Rosuav set messages: +
2016-07-21 14:19:36 Rosuav set files: + method4-change-all-errors-if-possible.patch
2016-07-21 14:19:22 Rosuav set files: + method3-change-all-errors.patch
2016-07-21 14:19:14 Rosuav set files: + method2-change-cur-and-inp.patch
2016-07-21 14:19:04 Rosuav set files: + method1-change-cur.patchkeywords: + patchmessages: +
2016-07-21 13:28:41 Rosuav set messages: +
2016-07-21 13:26:46 berker.peksag set nosy: + berker.peksagmessages: +
2016-07-21 11:57:50 Rosuav set nosy: + Rosuav
2016-07-21 11:54:14 ncoghlan create