Issue 25311: Add f-string support to tokenize.py (original) (raw)

Created on 2015-10-04 17:23 by skrah, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (13)

msg252274 - (view)

Author: Stefan Krah (skrah) * (Python committer)

Date: 2015-10-04 17:23

I think tokenize.py needs to be updated to support f-strings.

BTW, the f-string implementation seems to be incredibly robust. Nice work!

msg252275 - (view)

Author: Eric V. Smith (eric.smith) * (Python committer)

Date: 2015-10-04 17:34

Thanks for noticing tokenize.py. And thanks for the kind note!

msg252295 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2015-10-05 00:42

I was just about to make the same bug report :) I guess it would be fine to tokenize F-strings as the same string objects as others, it probably just needs adding an F to the right regular expression.

$ ./python -btWall -m tokenize "string" 1,0-1,8: STRING '"string"'
1,8-1,9: NEWLINE '\n'
b"string" 3,0-3,9: STRING 'b"string"'
3,9-3,10: NEWLINE '\n'
f"string" 4,0-4,1: NAME 'f'
4,1-4,9: STRING '"string"'
4,9-4,10: NEWLINE '\n'

msg252479 - (view)

Author: Nan Wu (Nan Wu) *

Date: 2015-10-07 17:02

Added 'f'/'F' to the StringPrefix regex and also update the quote dictionary.

msg252485 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2015-10-07 21:08

Thanks for the patch. Do you want to try adding a test case. See TokenizeTest.test_string() at /Lib/test/test_tokenize.py:189 for a guide, though I would suggest a new test_fstring() method.

Also, F-strings can be combined with the raw string syntax. I wonder if you need to add support for things like rf". . ." and FR'''. . .'''.

msg252522 - (view)

Author: Eric V. Smith (eric.smith) * (Python committer)

Date: 2015-10-08 09:28

Yes, both 'fr' and 'rf' need to be supported (and all upper/lower variants). And in the future, maybe 'fb' (and 'rfb', 'bfr', ...).

Unfortunately, the regex doesn't scale well for all of the combinations.

msg252610 - (view)

Author: Eric V. Smith (eric.smith) * (Python committer)

Date: 2015-10-09 13:34

I think the best way to approach this is to generate (in code) all of the places where string prefixes appear. There's StringPrefix, endpats, triple_quotes, and single_quoted.

With the currently valid combinations of f, b, r, and u, I count 24 combinations: ['B', 'BR', 'Br', 'F', 'FR', 'Fr', 'R', 'RB', 'RF', 'Rb', 'Rf', 'U', 'b', 'bR', 'br', 'f', 'fR', 'fr', 'r', 'rB', 'rF', 'rb', 'rf', 'u']

If I add "fb" strings (plus raw), I count 72 combinations: ['B', 'BFR', 'BFr', 'BR', 'BRF', 'BRf', 'BfR', 'Bfr', 'Br', 'BrF', 'Brf', 'F', 'FBR', 'FBr', 'FR', 'FRB', 'FRb', 'FbR', 'Fbr', 'Fr', 'FrB', 'Frb', 'R', 'RB', 'RBF', 'RBf', 'RF', 'RFB', 'RFb', 'Rb', 'RbF', 'Rbf', 'Rf', 'RfB', 'Rfb', 'U', 'b', 'bFR', 'bFr', 'bR', 'bRF', 'bRf', 'bfR', 'bfr', 'br', 'brF', 'brf', 'f', 'fBR', 'fBr', 'fR', 'fRB', 'fRb', 'fbR', 'fbr', 'fr', 'frB', 'frb', 'r', 'rB', 'rBF', 'rBf', 'rF', 'rFB', 'rFb', 'rb', 'rbF', 'rbf', 'rf', 'rfB', 'rfb', 'u']

Coding these combinations by hand seems insane.

msg252613 - (view)

Author: Eric V. Smith (eric.smith) * (Python committer)

Date: 2015-10-09 14:06

Oops, make that 80 combinations (I forgot the various 'fb' ones):

['B', 'BF', 'BFR', 'BFr', 'BR', 'BRF', 'BRf', 'Bf', 'BfR', 'Bfr', 'Br', 'BrF', 'Brf', 'F', 'FB', 'FBR', 'FBr', 'FR', 'FRB', 'FRb', 'Fb', 'FbR', 'Fbr', 'Fr', 'FrB', 'Frb', 'R', 'RB', 'RBF', 'RBf', 'RF', 'RFB', 'RFb', 'Rb', 'RbF', 'Rbf', 'Rf', 'RfB', 'Rfb', 'U', 'b', 'bF', 'bFR', 'bFr', 'bR', 'bRF', 'bRf', 'bf', 'bfR', 'bfr', 'br', 'brF', 'brf', 'f', 'fB', 'fBR', 'fBr', 'fR', 'fRB', 'fRb', 'fb', 'fbR', 'fbr', 'fr', 'frB', 'frb', 'r', 'rB', 'rBF', 'rBf', 'rF', 'rFB', 'rFb', 'rb', 'rbF', 'rbf', 'rf', 'rfB', 'rfb', 'u']

import itertools as _itertools

def _all_string_prefixes(): # The valid string prefixes. Only contain the lower case versions, # and don't contain any permuations (include 'fr', but not # 'rf'). The various permutations will be generated. _valid_string_prefixes = ['b', 'r', 'u', 'f', 'br', 'fr', 'fb', 'fbr'] result = set() for prefix in _valid_string_prefixes: for t in _itertools.permutations(prefix): # create a list with upper and lower versions of each # character for u in _itertools.product(*[(c, c.upper()) for c in t]): result.add(''.join(u)) return result

msg252619 - (view)

Author: Eric V. Smith (eric.smith) * (Python committer)

Date: 2015-10-09 15:27

My first attempt. Many more tests are needed.

I'm going to need to spend some time trying to figure out how parts of tokenize.py actually works. I'm not sure, for example, that endpats is initialized correctly. There definitely aren't enough tests, since if I comment out parts of endpats the tests still pass.

msg253109 - (view)

Author: Eric V. Smith (eric.smith) * (Python committer)

Date: 2015-10-17 00:47

Multi-line string tests were added in changeset 91c44dc35dfd. That will make changes for this issue safer. Updated patch to come.

msg253236 - (view)

Author: Eric V. Smith (eric.smith) * (Python committer)

Date: 2015-10-20 17:22

This patch cleans up string matching in tokenize.py, and adds f-string support.

msg253461 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2015-10-26 08:38

New changeset 21f6c4378846 by Eric V. Smith in branch 'default': Issue 25311: Add support for f-strings to tokenize.py. Also added some comments to explain what's happening, since it's not so obvious. https://hg.python.org/cpython/rev/21f6c4378846

msg253463 - (view)

Author: Eric V. Smith (eric.smith) * (Python committer)

Date: 2015-10-26 08:44

I've fixed this particular problem, but the tokenize module definitely has some other issues. It recompiles regexes very often when it doesn't need to, it treats single- and triple-quoted strings differently (leading to some code bloat), etc. I may open another issue to address some of these problems.

And I'll be adding more tests. tokenize is still woefully under-tested.

History

Date

User

Action

Args

2022-04-11 14:58:22

admin

set

github: 69498

2015-10-26 08:44:45

eric.smith

set

keywords: - patch
status: open -> closed
stage: patch review -> resolved

2015-10-26 08:44:04

eric.smith

set

resolution: fixed
messages: +

2015-10-26 08:38:11

python-dev

set

nosy: + python-dev
messages: +

2015-10-20 17:22:24

eric.smith

set

files: + issue25311-1.diff

messages: +

2015-10-17 00:47:53

eric.smith

set

messages: +

2015-10-09 20:38:16

@nkit

set

nosy: - @nkit

2015-10-09 20:17:01

@nkit

set

nosy: + @nkit

2015-10-09 15:27:08

eric.smith

set

files: + issue25311.diff

messages: +

2015-10-09 14:06:43

eric.smith

set

messages: +

2015-10-09 13:34:02

eric.smith

set

messages: +

2015-10-08 09:28:38

eric.smith

set

messages: +

2015-10-07 21:08:48

martin.panter

set

messages: +
stage: needs patch -> patch review

2015-10-07 17:02:48

Nan Wu

set

files: + tokenize.patch

nosy: + Nan Wu
messages: +

keywords: + patch

2015-10-05 00:42:55

martin.panter

set

keywords: + easy
nosy: + martin.panter
messages: +

2015-10-04 17:34:49

eric.smith

set

assignee: eric.smith
messages: +

2015-10-04 17:23:27

skrah

create