[Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer? (original) (raw)

Guido van Rossum guido at python.org
Thu May 17 21:51:56 EDT 2018

Previous message (by thread): [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
Next message (by thread): [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

To answer Larry's question, there's an overwhelming number of different options -- bytes/unicode, raw/cooked, and (in Py2) from __future__ import unicode_literals. So it's easier to do the actual semantic conversion in a later stage -- then the lexer only has to worry about hopping over backslashes.

On Thu, May 17, 2018 at 3:38 PM, Eric V. Smith <eric at trueblade.com> wrote:

On 5/17/2018 3:01 PM, Larry Hastings wrote:

I fed this into tokenize.tokenize(): b''' x = "\u1234" ''' I was a bit surprised to see \Uxxxx in the output. Particularly because the output (t.string) was a string and not bytes. For those (like me) who have no idea how to use tokenize.tokenize's wacky interface, the test code is: list(tokenize.tokenize(io.BytesIO(b''' x = "\u1234" ''').readline)) Maybe I'm making a parade of my ignorance, but I assumed that string literals were parsed by the parser--just like everything else is parsed by the parser, hey it seems like a good place for it--and in particular that the escape sequence substitutions would be done in the tokenizer. Having stared at it a little, I now detect a whiff of "this design solved a real problem". So... what was the problem, and how does this design solve it? I assume the intent is to not throw away any information in the lexer, and give the parser full access to the original string. But that's just a guess. BTW, my use case is that I hoped to use CPython's tokenizer to parse some Python-ish-looking text and handle double-quoted strings for me. Especially all the escape sequences--leveraging all CPython's support for funny things like \U{penguin}. The current behavior of the tokenizer makes me think it'd be easier to roll my own! Can you feed the token text to the ast? >>> ast.literaleval('"\u1234"') 'ሴ' Eric

Python-Dev mailing list Python-Dev at python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/guido% 40python.org

-- --Guido van Rossum (python.org/~guido) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20180517/34482a72/attachment-0001.html>

Previous message (by thread): [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
Next message (by thread): [Python-Dev] Why aren't escape sequences in literal strings handled by the tokenizer?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list