[Python-Dev] What does a double coding cookie mean? (original) (raw)

Guido van Rossum guido at python.org
Thu Mar 17 15:11:04 EDT 2016


On Thu, Mar 17, 2016 at 9:50 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:

On 17.03.16 16:55, Guido van Rossum wrote:

On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:

Should we recommend that everyone use tokenize.detectencoding()?

Likely. However the interface of tokenize.detectencoding() is not very simple. I just found that out yesterday. You have to give it a readline() function, which is cumbersome if all you have is a (byte) string and you don't want to split it on lines just yet. And the readline() function raises SyntaxError when the encoding isn't right. I wish there were a lower-level helper that just took a line and told you what the encoding in it was, if any. Then the rest of the logic can be handled by the caller (including the logic of trying up to two lines). The simplest way to detect encoding of bytes string: lines = data.splitlines() encoding = tokenize.detectencoding(iter(lines).next)[0]

This will raise SyntaxError if the encoding is unknown. That needs to be caught in mypy's case and then it needs to get the line number from the exception. I tried this and it was too painful, so now I've just changed the regex that mypy uses to use non-eager matching (https://github.com/python/mypy/commit/b291998a46d580df412ed28af1ba1658446b9fe5).

If you don't want to split all data on lines, the most efficient way in Python 3.5 is:

encoding = tokenize.detectencoding(io.BytesIO(data).readline)[0] In Python 3.5 io.BytesIO(data) has constant complexity.

Ditto with the SyntaxError though.

In older versions for detecting encoding without copying data or splitting all data on lines you should write line iterator. For example:

def iterlines(data): start = 0 while True: end = data.find(b'\n', start) + 1 if not end: break yield data[start:end] start = end yield data[start:] encoding = tokenize.detectencoding(iterlines(data).next)[0] or it = (m.group() for m in re.finditer(b'.*\n?', data)) encoding = tokenize.detectencoding(it.next) I don't know what approach is more efficient.

Having my own regex was simpler. :-(

-- --Guido van Rossum (python.org/~guido)



More information about the Python-Dev mailing list