Issue 25311: Add f-string support to tokenize.py (original) (raw)
Created on 2015-10-04 17:23 by skrah, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (13)
Author: Stefan Krah (skrah) *
Date: 2015-10-04 17:23
I think tokenize.py needs to be updated to support f-strings.
BTW, the f-string implementation seems to be incredibly robust. Nice work!
Author: Eric V. Smith (eric.smith) *
Date: 2015-10-04 17:34
Thanks for noticing tokenize.py. And thanks for the kind note!
Author: Martin Panter (martin.panter) *
Date: 2015-10-05 00:42
I was just about to make the same bug report :) I guess it would be fine to tokenize F-strings as the same string objects as others, it probably just needs adding an F to the right regular expression.
$ ./python -btWall -m tokenize
"string"
1,0-1,8: STRING '"string"'
1,8-1,9: NEWLINE '\n'
b"string"
3,0-3,9: STRING 'b"string"'
3,9-3,10: NEWLINE '\n'
f"string"
4,0-4,1: NAME 'f'
4,1-4,9: STRING '"string"'
4,9-4,10: NEWLINE '\n'
Author: Nan Wu (Nan Wu) *
Date: 2015-10-07 17:02
Added 'f'/'F' to the StringPrefix regex and also update the quote dictionary.
Author: Martin Panter (martin.panter) *
Date: 2015-10-07 21:08
Thanks for the patch. Do you want to try adding a test case. See TokenizeTest.test_string() at /Lib/test/test_tokenize.py:189 for a guide, though I would suggest a new test_fstring() method.
Also, F-strings can be combined with the raw string syntax. I wonder if you need to add support for things like rf". . ." and FR'''. . .'''.
Author: Eric V. Smith (eric.smith) *
Date: 2015-10-08 09:28
Yes, both 'fr' and 'rf' need to be supported (and all upper/lower variants). And in the future, maybe 'fb' (and 'rfb', 'bfr', ...).
Unfortunately, the regex doesn't scale well for all of the combinations.
Author: Eric V. Smith (eric.smith) *
Date: 2015-10-09 13:34
I think the best way to approach this is to generate (in code) all of the places where string prefixes appear. There's StringPrefix, endpats, triple_quotes, and single_quoted.
With the currently valid combinations of f, b, r, and u, I count 24 combinations: ['B', 'BR', 'Br', 'F', 'FR', 'Fr', 'R', 'RB', 'RF', 'Rb', 'Rf', 'U', 'b', 'bR', 'br', 'f', 'fR', 'fr', 'r', 'rB', 'rF', 'rb', 'rf', 'u']
If I add "fb" strings (plus raw), I count 72 combinations: ['B', 'BFR', 'BFr', 'BR', 'BRF', 'BRf', 'BfR', 'Bfr', 'Br', 'BrF', 'Brf', 'F', 'FBR', 'FBr', 'FR', 'FRB', 'FRb', 'FbR', 'Fbr', 'Fr', 'FrB', 'Frb', 'R', 'RB', 'RBF', 'RBf', 'RF', 'RFB', 'RFb', 'Rb', 'RbF', 'Rbf', 'Rf', 'RfB', 'Rfb', 'U', 'b', 'bFR', 'bFr', 'bR', 'bRF', 'bRf', 'bfR', 'bfr', 'br', 'brF', 'brf', 'f', 'fBR', 'fBr', 'fR', 'fRB', 'fRb', 'fbR', 'fbr', 'fr', 'frB', 'frb', 'r', 'rB', 'rBF', 'rBf', 'rF', 'rFB', 'rFb', 'rb', 'rbF', 'rbf', 'rf', 'rfB', 'rfb', 'u']
Coding these combinations by hand seems insane.
Author: Eric V. Smith (eric.smith) *
Date: 2015-10-09 14:06
Oops, make that 80 combinations (I forgot the various 'fb' ones):
['B', 'BF', 'BFR', 'BFr', 'BR', 'BRF', 'BRf', 'Bf', 'BfR', 'Bfr', 'Br', 'BrF', 'Brf', 'F', 'FB', 'FBR', 'FBr', 'FR', 'FRB', 'FRb', 'Fb', 'FbR', 'Fbr', 'Fr', 'FrB', 'Frb', 'R', 'RB', 'RBF', 'RBf', 'RF', 'RFB', 'RFb', 'Rb', 'RbF', 'Rbf', 'Rf', 'RfB', 'Rfb', 'U', 'b', 'bF', 'bFR', 'bFr', 'bR', 'bRF', 'bRf', 'bf', 'bfR', 'bfr', 'br', 'brF', 'brf', 'f', 'fB', 'fBR', 'fBr', 'fR', 'fRB', 'fRb', 'fb', 'fbR', 'fbr', 'fr', 'frB', 'frb', 'r', 'rB', 'rBF', 'rBf', 'rF', 'rFB', 'rFb', 'rb', 'rbF', 'rbf', 'rf', 'rfB', 'rfb', 'u']
import itertools as _itertools
def _all_string_prefixes(): # The valid string prefixes. Only contain the lower case versions, # and don't contain any permuations (include 'fr', but not # 'rf'). The various permutations will be generated. _valid_string_prefixes = ['b', 'r', 'u', 'f', 'br', 'fr', 'fb', 'fbr'] result = set() for prefix in _valid_string_prefixes: for t in _itertools.permutations(prefix): # create a list with upper and lower versions of each # character for u in _itertools.product(*[(c, c.upper()) for c in t]): result.add(''.join(u)) return result
Author: Eric V. Smith (eric.smith) *
Date: 2015-10-09 15:27
My first attempt. Many more tests are needed.
I'm going to need to spend some time trying to figure out how parts of tokenize.py actually works. I'm not sure, for example, that endpats is initialized correctly. There definitely aren't enough tests, since if I comment out parts of endpats the tests still pass.
Author: Eric V. Smith (eric.smith) *
Date: 2015-10-17 00:47
Multi-line string tests were added in changeset 91c44dc35dfd. That will make changes for this issue safer. Updated patch to come.
Author: Eric V. Smith (eric.smith) *
Date: 2015-10-20 17:22
This patch cleans up string matching in tokenize.py, and adds f-string support.
Author: Roundup Robot (python-dev)
Date: 2015-10-26 08:38
New changeset 21f6c4378846 by Eric V. Smith in branch 'default': Issue 25311: Add support for f-strings to tokenize.py. Also added some comments to explain what's happening, since it's not so obvious. https://hg.python.org/cpython/rev/21f6c4378846
Author: Eric V. Smith (eric.smith) *
Date: 2015-10-26 08:44
I've fixed this particular problem, but the tokenize module definitely has some other issues. It recompiles regexes very often when it doesn't need to, it treats single- and triple-quoted strings differently (leading to some code bloat), etc. I may open another issue to address some of these problems.
And I'll be adding more tests. tokenize is still woefully under-tested.
History
Date
User
Action
Args
2022-04-11 14:58:22
admin
set
github: 69498
2015-10-26 08:44:45
eric.smith
set
keywords: - patch
status: open -> closed
stage: patch review -> resolved
2015-10-26 08:44:04
eric.smith
set
resolution: fixed
messages: +
2015-10-26 08:38:11
python-dev
set
nosy: + python-dev
messages: +
2015-10-20 17:22:24
eric.smith
set
files: + issue25311-1.diff
messages: +
2015-10-17 00:47:53
eric.smith
set
messages: +
2015-10-09 20:38:16
@nkit
set
nosy: - @nkit
2015-10-09 20:17:01
@nkit
set
nosy: + @nkit
2015-10-09 15:27:08
eric.smith
set
files: + issue25311.diff
messages: +
2015-10-09 14:06:43
eric.smith
set
messages: +
2015-10-09 13:34:02
eric.smith
set
messages: +
2015-10-08 09:28:38
eric.smith
set
messages: +
2015-10-07 21:08:48
martin.panter
set
messages: +
stage: needs patch -> patch review
2015-10-07 17:02:48
Nan Wu
set
files: + tokenize.patch
nosy: + Nan Wu
messages: +
keywords: + patch
2015-10-05 00:42:55
martin.panter
set
keywords: + easy
nosy: + martin.panter
messages: +
2015-10-04 17:34:49
eric.smith
set
assignee: eric.smith
messages: +
2015-10-04 17:23:27
skrah
create