Issue 32982: Parse out invisible Unicode characters? (original) (raw)

Created on 2018-03-02 06:04 by leewz, last changed 2022-04-11 14:58 by admin.

Messages (4)
msg313127 - (view)	Author: Franklin? Lee (leewz)	Date: 2018-03-02 06:04
The following line should have a character that trips up the compiler. ‎indices = range(5) The character is \u200e, and was inserted by Google Keep. (I've already reported the issue to Google as a regression.) Here's the error message: """ File "", line 3 ‎indices = range(5) ^ SyntaxError: invalid character in identifier """ Depending on the terminal or editor, it may not be possible to tell the problem just from looking. Without knowledge/experience of Unicode, it may not be possible to figure out the problem at all. Since Python source now uses Unicode by default, should certain invisible characters be stripped out during compilation?
msg313155 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2018-03-02 19:10
For the record, '\u200e' is '\N{LEFT-TO-RIGHT MARK}'.
msg313159 - (view)	Author: Glenn Linderman (v+python) *	Date: 2018-03-02 19:46
Characters should not be stripped during compilation. But I can see where it might be helpful if the codepoint of the character, and the printed form just in case it is printable, could helpfully be included in the error message, as well as having the ^ pointer pointing to it.
msg313629 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2018-03-12 00:42
I think it sounds like a good idea to put the printed representation as a repered string, followed by the code point representation in parenthesis, in that message after "invalid character".