Issue 26152: A non-breaking space in a source (original) (raw)

Created on 2016-01-19 12:01 by Drekin, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (9)
msg258584 - (view)	Author: Adam Bartoš (Drekin) *	Date: 2016-01-19 12:01
Consider the following code: >>> 1, 2 File "", line 1 1, 2 ^ SyntaxError: invalid character in identifier The error is due to the fact, that the space before "2" is actually a non-breaking space. The error message and the position of the caret is misleading. The tokenize module gives an ERRORTOKEN at the position of the space, so shouldn't the massage be more like "invalid syntax" with the correct position or even something more appropriate?
msg258616 - (view)	Author: Andrew Barnert (abarnert) *	Date: 2016-01-19 18:53
Ultimately, this is because the tokenizer works byte by byte instead of character by character, as far as possible. Since any byte >= 128 must be part of some non-ASCII character, and the only legal use for non-ASCII characters outside of quotes and comments is as part of an identifier, the tokenizer assumes (see the macros at the top of tokenizer.c, and the top of the again block in tok_get) that any byte >= 128 is part of an identifier, and then checks the whole string with PyUnicode_IsIdentifier at the end. This actually gives a better error for more visible glyphs, especially ones that look letter-like but aren't in XID_Continue, but it is kind of weird for a few, like non-break space. If this needs to be fixed, I think the simplest thing is to special-case things: if the first non-valid-identifier character is in category Z, set an error about invalid whitespace instead of invalid identifier character. (This would probably require adding a PyUnicode_CheckIdentifier that, instead of just returning 0 for failure as PyUnicode_IsIdentifier, returns -n for non-identifier character with code point n.)
msg258679 - (view)	Author: Adam Bartoš (Drekin) *	Date: 2016-01-20 13:05
That explains the message. But why is the caret at a wrong place?
msg258711 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-01-20 20:05
The caret always points to the end of the token, I think.
msg258713 - (view)	Author: Adam Bartoš (Drekin) *	Date: 2016-01-20 20:31
We have one particular invalid token, so why it should point to the next token rather than to the invalid one?
msg258714 - (view)	Author: Martin Panter (martin.panter) *	Date: 2016-01-20 20:40
Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP and the “2” as part of the same token, because NBSP is non-ASCII.
msg258715 - (view)	Author: Adam Bartoš (Drekin) *	Date: 2016-01-20 20:48
It could still point to the first or the last byte of the invalid token rather than to the start of the next token. Also, by the Python implementation of the tokenizer in tokenize module we get an ERRORTOKEN containing a non-breaking space followed by a number token containing 2.
msg258722 - (view)	Author: Andrew Barnert (abarnert) *	Date: 2016-01-20 22:01
> Assuming Andrew is correct, it sounds like the tokenizer is treating the NBSP and the “2” as part of the same token, because NBSP is non-ASCII. It's more complicated than that. When you get an invalid character, it splits the token up. So, in this case, you get a separate `ERRORTOKEN` from cols 2-3 and `NUMBER` token from cols 3-4. Even in the case of `1, a\xa0\xa02`, you get a `NAME` token for the `a`, a separate `ERRORTOKEN` for each nbsp, and a `NUMBER` token for the `2`. But I think the code that generates the `SyntaxError` must be trying to re-generate the "intended token" from the broken one. For example: >>> eval('1\xa0\xa02a') File "", line 1 1 2a ^ SyntaxError: invalid character in identifier And if you capture the error and look at it, `e.args[1][1:3]` is 1, 5, which matches what you see. But if you tokenize it (e.g., `list(tokenize.tokenize(io.BytesIO('1\xa0\xa02a'.encode('utf-8')).readline))`, but you'll probably want to wrap that up in a function if you're playing with it a lot...), you get a `NUMBER` from 0-1, an `ERRORTOKEN` from 1-2, another `ERRORTOKEN` from 2-3, a `NUMBER` from 3-4, and a `NAME` from 4-5. So, why does the `SyntaxError` point at the `NAME` instead of the first `ERRORTOKEN`? Presumably there's some logic that tries to work out that the two `ERRORTOKEN`s, `NUMBER`, and `NAME` were all intended to be one big identifier and points at that instead.
msg271139 - (view)	Author: Alyssa Coghlan (ncoghlan) *	Date: 2016-07-24 08:33
http://bugs.python.org/issue27582 is a later mention of the same problem that attracted patches before Adam noticed it was a repeat of this issue. Marking this as the duplicate, since the problem applies to more than just Unicode whitespace, and the problems being discussed there should also help with this subproblem.

History
Date	User	Action	Args
2022-04-11 14:58:26	admin	set	github: 70340
2016-07-24 08:33:56	ncoghlan	set	status: open -> closedresolution: duplicate
2016-07-24 08:33:41	ncoghlan	set	superseder: Mispositioned SyntaxError caret for unknown code pointsmessages: + nosy: + ncoghlan
2016-01-20 22:01:44	abarnert	set	messages: +
2016-01-20 20:48:07	Drekin	set	messages: +
2016-01-20 20:40:13	martin.panter	set	messages: +
2016-01-20 20:31:52	Drekin	set	messages: +
2016-01-20 20:05:58	martin.panter	set	nosy: + martin.pantermessages: +
2016-01-20 13:05:51	Drekin	set	messages: +
2016-01-19 18:53:52	abarnert	set	nosy: + abarnertmessages: +
2016-01-19 12:01:37	Drekin	create