Issue 33338: [lib2to3] Synchronize token.py and tokenize.py with the standard library (original) (raw)

Issue33338

Created on 2018-04-23 01:04 by lukasz.langa, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 6572	closed	lukasz.langa,2018-04-23 01:09
PR 6573	merged	lukasz.langa,2018-04-23 01:12
PR 8950	monson,2018-09-15 17:37

Messages (5)
msg315639 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 01:04
lib2to3's token.py and tokenize.py were initially copies of the respective files from the standard library. They were copied to allow Python 3 to read Python 2's grammar. Since 2006, lib2to3 grew to be widely used as a Concrete Syntax Tree, also for parsing Python 3 code. Additions to support Python 3 grammar were added but sadly, the main token.py and tokenize.py diverged. This change brings them back together, minimizing the differences to the bare minimum that is in fact required by lib2to3. Before this change, almost every line in lib2to3/pgen2/tokenize.py was different from tokenize.py. After this change, the diff between the two files is only 175 lines long and is entirely filled with relevant Python 2 compatibility bits. Merging the implementations, there's numerous fixes to the lib2to3 tokenizer: + docstrings made as similar as possible + ported `TokenInfo` + ported `tokenize.tokenize()` and `tokenize.open()` + removed Python 2-only implementation cruft + fixes Unicode identifier handling + fixes string prefix handling + fixes Ellipsis handling + Untokenizer backported bugfixes: - 5e6db313686c200da425a54d2e0c95fa40107b1d - 9dc3a36c849c15c227a8af218cfb215abe7b3c48 - 5b8d2c3af76e704926cf5915ad0e6af59a232e61 - e411b6629fb5f7bc01bec89df75737875ce6d8f5 - BPO-2495 + tokenizer doesn't crash on missing newline at the end of the stream (added \Z (end of string) to PseudoExtras) - BPO-16152 + `find_cookie` includes file name in error messages, if available + `find_cookie` raises SyntaxError on invalid encodings: BPO-14990 Improvements to lib2to3/pgen2/token.py: + taken from the current Lib/token.py + tokens renumbered to match Lib/token.py + `__all__` properly defined + ASYNC, AWAIT and BACKQUOTE exist under different numbers (100 + old number) + ELLIPSIS added + ENCODING added
msg315640 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 01:05
### Diff between files The unified diff between tokenize implementations is here: https://gist.github.com/ambv/679018041d85dd1a7497e6d89c45fb86 It clocks at 275 lines but that's because it gives context. The actual diff is 175 lines long. To make it that small, I needed to move some insignificant bits in Lib/tokenize.py. This is what the other PR on this issue is about.
msg315650 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-23 08:07
New changeset c2d384dbd7c6ed9bdfaac45f05b463263c743ee7 by Łukasz Langa in branch 'master': bpo-33338: [tokenize] Minor code cleanup (#6573) https://github.com/python/cpython/commit/c2d384dbd7c6ed9bdfaac45f05b463263c743ee7
msg315802 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2018-04-26 14:50
It seems to me that regular expressions used in the lib2to3 version are more efficient but more complex. $ ./python -m timeit -s 'import re; p = re.compile(r"0[bB](?:_?[01])+"); s = "0b"+"_0101"16' 'p.match(s)' 100000 loops, best of 5: 2.45 usec per loop $ ./python -m timeit -s 'import re; p = re.compile(r"0[bB]_?[01]+(?:_[01]+)"); s = "0b"+"_0101"16' 'p.match(s)' 200000 loops, best of 5: 1.08 usec per loop $ ./python -m timeit -s 'import re; p = re.compile(r"0[xX](?:_?[0-9a-fA-F])+[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)' 500000 loops, best of 5: 815 nsec per loop $ ./python -m timeit -s 'import re; p = re.compile(r"0[xX]_?[\da-fA-F]+(?:_[\da-fA-F]+)[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)' 500000 loops, best of 5: 542 nsec per loop Since the performance of lib2to3 is important, it is better to keep the current regexpes. But using \d in Python 3 is a bug, it should be replaced with [0-9]. This also speeds up the regex: $ ./python -m timeit -s 'import re; p = re.compile(r"0[xX]_?[0-9a-fA-F]+(?:_[0-9a-fA-F]+)*[lL]?"); s = "0x_0123_4567_89ab_cdef"' 'p.match(s)' 500000 loops, best of 5: 471 nsec per loop
msg315807 - (view)	Author: Łukasz Langa (lukasz.langa) *	Date: 2018-04-26 17:01
I agree with you Serhiy, there's a number things I want to make faster. But first I'd like to merge implementations so there is a clear one-way diff ("this is what we updated in lib2to3 to make it consistent it Lib/tokenize.py"). Then I want to optimize.

History
Date	User	Action	Args
2022-04-11 14:58:59	admin	set	github: 77519
2021-10-20 22:50:30	iritkatriel	set	status: open -> closedsuperseder: Close 2to3 issues and list them hereresolution: wont fixstage: patch review -> resolved
2018-09-15 17:37:33	monson	set	pull_requests: + <pull%5Frequest8757>
2018-04-26 17:01:08	lukasz.langa	set	messages: +
2018-04-26 14:50:37	serhiy.storchaka	set	nosy: + serhiy.storchakamessages: +
2018-04-23 08:07:19	lukasz.langa	set	messages: +
2018-04-23 01:12:32	lukasz.langa	set	pull_requests: + <pull%5Frequest6274>
2018-04-23 01:09:05	lukasz.langa	set	keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest6269>
2018-04-23 01:05:31	lukasz.langa	set	messages: +
2018-04-23 01:04:56	lukasz.langa	create