Issue 33338: [lib2to3] Synchronize token.py and tokenize.py with the standard library (original) (raw)

Issue33338

Created on 2018-04-23 01:04 by lukasz.langa, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 6572 closed lukasz.langa,2018-04-23 01:09
PR 6573 merged lukasz.langa,2018-04-23 01:12
PR 8950 monson,2018-09-15 17:37
Messages (5)
msg315639 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 01:04
lib2to3's token.py and tokenize.py were initially copies of the respective files from the standard library. They were copied to allow Python 3 to read Python 2's grammar. Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for parsing Python 3 code. Additions to support Python 3 grammar were added but sadly, the main token.py and tokenize.py diverged. This change brings them back together, minimizing the differences to the bare minimum that is in fact required by lib2to3. Before this change, almost every line in lib2to3/pgen2/tokenize.py was different from tokenize.py. After this change, the diff between the two files is only 175 lines long and is entirely filled with relevant Python 2 compatibility bits. Merging the implementations, there's numerous fixes to the lib2to3 tokenizer: + docstrings made as similar as possible + ported `TokenInfo` + ported `tokenize.tokenize()` and `tokenize.open()` + removed Python 2-only implementation cruft + fixes Unicode identifier handling + fixes string prefix handling + fixes Ellipsis handling + Untokenizer backported bugfixes: - 5e6db313686c200da425a54d2e0c95fa40107b1d - 9dc3a36c849c15c227a8af218cfb215abe7b3c48 - 5b8d2c3af76e704926cf5915ad0e6af59a232e61 - e411b6629fb5f7bc01bec89df75737875ce6d8f5 - BPO-2495 + tokenizer doesn't crash on missing newline at the end of the stream (added \Z (end of string) to PseudoExtras) - BPO-16152 + `find_cookie` includes file name in error messages, if available + `find_cookie` raises SyntaxError on invalid encodings: BPO-14990 Improvements to lib2to3/pgen2/token.py: + taken from the current Lib/token.py + tokens renumbered to match Lib/token.py + `__all__` properly defined + ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number) + ELLIPSIS added + ENCODING added
msg315640 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 01:05
### Diff between files The unified diff between tokenize implementations is here: https://gist.github.com/ambv/679018041d85dd1a7497e6d89c45fb86 It clocks at 275 lines but that's because it gives context. The actual diff is 175 lines long. To make it that small, I needed to move some insignificant bits in Lib/tokenize.py. This is what the other PR on this issue is about.
msg315650 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-23 08:07
New changeset c2d384dbd7c6ed9bdfaac45f05b463263c743ee7 by Łukasz Langa in branch 'master': bpo-33338: [tokenize] Minor code cleanup (#6573) https://github.com/python/cpython/commit/c2d384dbd7c6ed9bdfaac45f05b463263c743ee7
msg315802 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-04-26 14:50
It seems to me that regular expressions used in the lib2to3 version are more efficient but more complex. $ ./python -m timeit -s 'import re; p = re.compile(r"0[bB](?:_?[01])+"); s = "0b"+"_0101"*16' 'p.match(s)' 100000 loops, best of 5: 2.45 usec per loop $ ./python -m timeit -s 'import re; p = re.compile(r"0[bB]_?[01]+(?:_[01]+)*"); s = "0b"+"_0101"*16' 'p.match(s)' 200000 loops, best of 5: 1.08 usec per loop $ ./python -m timeit -s 'import re; p = re.compile(r"0[xX](?:_?[0-9a-fA-F])+[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)' 500000 loops, best of 5: 815 nsec per loop $ ./python -m timeit -s 'import re; p = re.compile(r"0[xX]_?[\da-fA-F]+(?:_[\da-fA-F]+)*[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)' 500000 loops, best of 5: 542 nsec per loop Since the performance of lib2to3 is important, it is better to keep the current regexpes. But using \d in Python 3 is a bug, it should be replaced with [0-9]. This also speeds up the regex: $ ./python -m timeit -s 'import re; p = re.compile(r"0[xX]_?[0-9a-fA-F]+(?:_[0-9a-fA-F]+)*[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)' 500000 loops, best of 5: 471 nsec per loop
msg315807 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2018-04-26 17:01
I agree with you Serhiy, there's a number things I want to make faster. But first I'd like to merge implementations so there is a clear one-way diff ("this is what we updated in lib2to3 to make it consistent it Lib/tokenize.py"). Then I want to optimize.
History
Date User Action Args
2022-04-11 14:58:59 admin set github: 77519
2021-10-20 22:50:30 iritkatriel set status: open -> closedsuperseder: Close 2to3 issues and list them hereresolution: wont fixstage: patch review -> resolved
2018-09-15 17:37:33 monson set pull_requests: + <pull%5Frequest8757>
2018-04-26 17:01:08 lukasz.langa set messages: +
2018-04-26 14:50:37 serhiy.storchaka set nosy: + serhiy.storchakamessages: +
2018-04-23 08:07:19 lukasz.langa set messages: +
2018-04-23 01:12:32 lukasz.langa set pull_requests: + <pull%5Frequest6274>
2018-04-23 01:09:05 lukasz.langa set keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest6269>
2018-04-23 01:05:31 lukasz.langa set messages: +
2018-04-23 01:04:56 lukasz.langa create