[Python-Dev] issue2180 and using 'tokenize' with Python 3 'str's (original) (raw)

Nick Coghlan ncoghlan at gmail.com
Tue Sep 28 14:09:48 CEST 2010


On Tue, Sep 28, 2010 at 9:29 PM, Michael Foord <fuzzyman at voidspace.org.uk> wrote:

 On 28/09/2010 12:19, Antoine Pitrou wrote:

On Mon, 27 Sep 2010 23:45:45 -0400 Steve Holden<steve at holdenweb.com>  wrote:

On 9/27/2010 11:27 PM, Benjamin Peterson wrote:

Tokenize only works on bytes. You can open a feature request if you desire.

Working only on bytes does seem rather perverse. I agree, the morality of bytes objects could have been better :) The reason for working with bytes is that source data can only be correctly decoded to text once the encoding is known. The encoding is determined by reading the encoding cookie. I certainly wouldn't be opposed to an API that accepts a string as well though.

A very quick scan of _tokenize suggests it is designed to support detect_encoding returning None to indicate the line iterator will return already decoded lines. This is confirmed by the fact the standard library uses it that way (via generate_tokens).

An API that accepts a string, wraps a StringIO around it, then calls _tokenise with an encoding of None would appear to be the answer here. A feature request on the tracker is the best way to make that happen.

Cheers, Nick.

-- Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia



More information about the Python-Dev mailing list