Issue 27582: Mispositioned SyntaxError caret for unknown code points (original) (raw)

Created on 2016-07-21 11:54 by ncoghlan, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
method1-change-cur.patch	Rosuav,2016-07-21 14:19	review
method2-change-cur-and-inp.patch	Rosuav,2016-07-21 14:19	review
method3-change-all-errors.patch	Rosuav,2016-07-21 14:19	review
method4-change-all-errors-if-possible.patch	Rosuav,2016-07-21 14:19	review
where-did-that-error-come-from.patch	Rosuav,2016-07-21 16:33	review
combined.patch	Rosuav,2016-07-21 16:35	review

Messages (12)
msg270914 - (view)	Author: Alyssa Coghlan (ncoghlan) *	Date: 2016-07-21 11:54
Reporting by Rustom Mody on python-ideas, the SyntaxError caret is confusingly mispositioned when an invalid Unicode codepoint appears as part of a larger sequence of invalid codepoints and/or valid identifier characters: >>> varname = “d“a”t”apoint File "", line 1 varname = “d“a”t”apoint ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint.evidence File "", line 1 varname = “d“a”t”apoint.evidence ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint[evidence] File "", line 1 varname = “d“a”t”apoint[evidence] ^ SyntaxError: invalid character in identifier >>> varname = “d“a”t”apoint(evidence) File "", line 1 varname = “d“a”t”apoint(evidence) ^ SyntaxError: invalid character in identifier If the invalid character is a non-identifiers ASCII character, the error message loses the trailing "in identifier" phrase and points to the correct place: >>> varname = ddda$t$apoint File "", line 1 varname = ddda$t$apoint ^ SyntaxError: invalid syntax >>> varname = d$a$t$apoint File "", line 1 varname = d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = ^d$a$t$apoint File "", line 1 varname = ^d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = !d$a$t$apoint File "", line 1 varname = !d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = `d$a$t$apoint File "", line 1 varname = `d$a$t$apoint ^ SyntaxError: invalid syntax
msg270921 - (view)	Author: Berker Peksag (berker.peksag) *	Date: 2016-07-21 13:26
This looks like a duplicate of issue 2382.
msg270922 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 13:28
The question was raised that there might be a problem with (UTF-8) bytes vs characters, but that's definitely not it - pythonrun.c:1362 UTF-8-decodes the line of source and then gets its character length to use as the new offset. So I don't think this is a duplicate of 2382. (Side point: There appears to be quite a bit of complexity inside the CPython parser to cope with the fact that it does everything in UTF-8 bytes rather than simply decoding to text and lexing that. I presume that's for the sake of efficiency - that it'd be too slow to work through PyUnicode everywhere?) Am looking into the rest.
msg270927 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 14:19
Actually pinpointing the invalid character may be impractical, as there are two boolean situations: either a UnicodeDecodeError (because you had an invalid UTF-8 stream), or PyUnicode_IsIdentifier returns false. Either way, it applies to the whole identifier. So there are a few possibilities, corresponding to the patches I'm attaching. 1) Change the way this one specific error is handled, in tokenizer.c verify_identifier(). If it finds an error, adjust tok->cur to point to the beginning of it. No new failures in test suite. 2) As above, but also change tok->inp, because of this comment in tokenizer.h:31 /* NB If done != E_OK, cur must be == inp!!! */ which I have no idea about the meaning of. This results in truncated error messages, but suggests that method 1 might be breaking an invariant that results in breakage elsewhere. If there are, though, they're not exercised by 'make test', which itself may be a problem. No new test failures. 3) Change the handling of ALL parser errors, in parsetok.c parsetok(), so now they all point to tok->start. Octal literals with 8s or 9s in them now get the caret pointing to the invalid digit, rather than the end of the literal. Unterminated strings point to the opening quote. And some forms of IndentationError now segfault Python. Test suite fails (unsurprisingly). 4) In response to the above segfault, hack it back to the old way of doing things if there's no tok->start. Maybe the condition should be done differently? No new failures in the test suite. I'd ideally like to use the technique from method 3 (either as patch 4 or with some other guard condition). Failing that, can anyone explain the "NB" above, and what ought to be done to comply with it?
msg270929 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 14:22
BTW, here's how a session looks using method 4's change: >>> varname = asdf“d“a”t”apoint File "", line 1 varname = asdf“d“a”t”apoint ^ SyntaxError: invalid character in identifier >>> varname = asdf“d“a”t”apoint.evidence File "", line 1 varname = asdf“d“a”t”apoint.evidence ^ SyntaxError: invalid character in identifier >>> varname = ddda$t$apoint File "", line 1 varname = ddda$t$apoint ^ SyntaxError: invalid syntax >>> varname = asdf$d$a$t$apoint File "", line 1 varname = asdf$d$a$t$apoint ^ SyntaxError: invalid syntax >>> varname = 0o1234987654 + 123 File "", line 1 varname = 0o1234987654 + 123 ^ SyntaxError: invalid syntax >>> varname = "asdf" + "qwer File "", line 1 varname = "asdf" + "qwer ^ SyntaxError: EOL while scanning string literal
msg270933 - (view)	Author: Alyssa Coghlan (ncoghlan) *	Date: 2016-07-21 15:06
Looking at issue 2382, I agree that's a different problem (I'm seeing the current misbehaviour even though everything is consistently encoded as UTF-8) The main case we're interested in here is the PyUnicode_IsIdentifier one, so if we wanted to do better than "start or end of the token", we could introduce a new internal "_PyUnicode_FindNonIdentifier" that reported the position of the first non-identifier character (or -1 if it's a valid identifier). Unfortunately, I'm not at all familiar with parsetok.c myself (my own work with the code generator has been from the AST on), so I don't have a ready answer for your other questions.
msg270940 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 16:33
Hmm, that'd be curious. The code to do that is actually pretty simple - see attached patch - but actually using that to affect error messages is a bit harder. Is it safe to mess with tok->start?
msg270941 - (view)	Author: Chris Angelico (Rosuav) *	Date: 2016-07-21 16:35
Attached is a combined patch that has the new private function for IsIdentifier, method 4's error handling change, and a bit of glue in the middle to make use of it. The result is a passing test suite (bar test_site which was already failing on my system) and a caret that points to the errant Unicode character, or the beginning of the identifier if it's a decode problem.
msg270946 - (view)	Author: Stephen J. Turnbull (sjt) *	Date: 2016-07-21 17:15
I still think the easiest thing to do would be to make all non-ASCII characters instances of "invalid_character_token", self-delimiting in the same way that operators are. That would automatically point to exactly the right place in the token stream, and requires zero changes to the error handling code. I don't have time to look at the code, but I suspect that you could handle this exactly the same way that ? and $ are handled, and maybe even use the same token type.
msg270986 - (view)	Author: Adam Bartoš (Drekin) *	Date: 2016-07-22 13:30
Maybe this is related: http://bugs.python.org/issue26152.
msg271138 - (view)	Author: Alyssa Coghlan (ncoghlan) *	Date: 2016-07-24 08:30
Drekin, agreed, that looks like the same problem. Since this one has draft patches attached, I'll mark the other one as a duplicate.
msg406869 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-11-23 17:42
This seems to have been fixed by now, I get this on 3.11: >>> varname = “d“a”t”apoint File "", line 1 varname = “d“a”t”apoint ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint.evidence File "", line 1 varname = “d“a”t”apoint.evidence ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint[evidence] File "", line 1 varname = “d“a”t”apoint[evidence] ^ SyntaxError: invalid character '“' (U+201C) >>> varname = “d“a”t”apoint(evidence) File "", line 1 varname = “d“a”t”apoint(evidence) ^ SyntaxError: invalid character '“' (U+201C)

History
Date	User	Action	Args
2022-04-11 14:58:34	admin	set	github: 71769
2021-12-03 17:16:24	iritkatriel	set	status: pending -> closedstage: needs patch -> resolved
2021-11-23 17:42:44	iritkatriel	set	status: open -> pendingnosy: + iritkatrielmessages: + resolution: fixed
2016-07-24 08:33:41	ncoghlan	link	issue26152 superseder
2016-07-24 08:30:51	ncoghlan	set	messages: +
2016-07-22 13:30:44	Drekin	set	nosy: + Drekinmessages: +
2016-07-21 17:15:04	sjt	set	nosy: + sjtmessages: +
2016-07-21 16:35:15	Rosuav	set	files: + combined.patchmessages: +
2016-07-21 16:33:21	Rosuav	set	files: + where-did-that-error-come-from.patchmessages: +
2016-07-21 15:06:14	ncoghlan	set	messages: +
2016-07-21 14:22:41	Rosuav	set	messages: +
2016-07-21 14:19:36	Rosuav	set	files: + method4-change-all-errors-if-possible.patch
2016-07-21 14:19:22	Rosuav	set	files: + method3-change-all-errors.patch
2016-07-21 14:19:14	Rosuav	set	files: + method2-change-cur-and-inp.patch
2016-07-21 14:19:04	Rosuav	set	files: + method1-change-cur.patchkeywords: + patchmessages: +
2016-07-21 13:28:41	Rosuav	set	messages: +
2016-07-21 13:26:46	berker.peksag	set	nosy: + berker.peksagmessages: +
2016-07-21 11:57:50	Rosuav	set	nosy: + Rosuav
2016-07-21 11:54:14	ncoghlan	create