Issue 12063: tokenize module appears to treat unterminated single and double-quoted strings inconsistently (original) (raw)

Tokenizing ' 1 2 3 versus ''' 1 2 3 yields different results.

Tokenizing ' 1 2 3 gives:

1,0-1,1: ERRORTOKEN "'" 1,2-1,3: NUMBER '1' 1,4-1,5: NUMBER '2' 1,6-1,7: NUMBER '3' 2,0-2,0: ENDMARKER ''

while tokenizing ''' 1 2 3 yields:

Traceback (most recent call last): File "prog.py", line 4, in tokenize.tokenize(iter(["''' 1 2 3"]).next) File "/usr/lib/python2.6/tokenize.py", line 169, in tokenize tokenize_loop(readline, tokeneater) File "/usr/lib/python2.6/tokenize.py", line 175, in tokenize_loop for token_info in generate_tokens(readline): File "/usr/lib/python2.6/tokenize.py", line 296, in generate_tokens raise TokenError, ("EOF in multi-line string", strstart) tokenize.TokenError: ('EOF in multi-line string', (1, 0))

Apparently tokenize decides to re-tokenize after the erroneous quote in the case of a single-quote, but not a triple-quote. I guess that this is because retokenizing the rest of the file after an unclosed triple-quote would be expensive; however, I've also been told it's very strange and possibly wrong for tokenize to be inconsistent this way.

If this is the right behavior, I guess I'd like it if it were documented. This sort of thing is confusing / potentially misleading for users of the tokenize module. Or at least, when I saw how single quotes were handled, I assumed incorrectly that all quotes were handled that way.

tokenize processes a line at a time, and noticing that an ending triple quote is missing would mean reading the whole file in the worst case. As tokenize seems to work in a generator-like fashion, it's probably not desired to cache all the input to be able to restart from some previous line.

So, I'd go with documenting the behavior.