[Python-Dev] What does a double coding cookie mean? (original) (raw)

M.-A. Lemburg mal at egenix.com
Thu Mar 17 14:35:02 EDT 2016


On 17.03.2016 18:53, Serhiy Storchaka wrote:

On 17.03.16 19:23, M.-A. Lemburg wrote:

On 17.03.2016 15:02, Serhiy Storchaka wrote:

On 17.03.16 15:14, M.-A. Lemburg wrote:

On 17.03.2016 01:29, Guido van Rossum wrote:

Should we recommend that everyone use tokenize.detectencoding()?

I'd prefer a separate utility for this somewhere, since tokenize.detectencoding() is not available in Python 2. I've attached an example implementation with tests, which works in Python 2.7 and 3. Sorry, but this code doesn't match the behaviour of Python interpreter, nor other tools. I suggest to backport tokenize.detectencoding() (but be aware that the default encoding in Python 2 is ASCII, not UTF-8). Yes, I got the default for Python 3 wrong. I'll fix that. Thanks for the note. What other aspects are different than what Python implements ? 1. If there is a BOM and coding cookie, the source encoding is "utf-8-sig".

Ok, that makes sense (even though it's not mandated by the PEP; the utf-8-sig codec didn't exist yet).

2. If there is a BOM and coding cookie is not 'utf-8', this is an error.

It's an error for Python, but why should a detection function always raise an error for this case ? It would probably be a good idea to have an errors parameter to leave this to the use to decide.

Same for unknown encodings.

3. If the first line is not blank or comment line, the coding cookie is not searched in the second line.

Hmm, the PEP does allow having the coding cookie in the second line, even if the first line is not a comment. Perhaps that's not really needed.

4. Encoding name should be canonized. "UTF8", "utf8", "utf8" and "utf-8" is the same encoding (and all are changed to "utf-8-sig" with BOM).

Well, that's cosmetics :-) The codec system will take care of this when needed.

5. There isn't the limit of 400 bytes. Actually there is a bug with handling long lines in current code, but even with this bug the limit is larger.

I think it's a reasonable limit, since shebang lines may only be 127 long on at least Linux (and probably several other Unix systems as well).

But just in case, I made this configurable :-)

6. I made a mistake in the regular expression, missed the underscore.

I added it.

tokenize.detectencoding() is the closest imitation of the behavior of Python interpreter.

Probably, but that doesn't us on Python 2, right ?

I'll upload the script to github later today or tomorrow to continue development.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Experts (#1, Mar 17 2016)

Python Projects, Coaching and Consulting ... http://www.egenix.com/ Python Database Interfaces ... http://products.egenix.com/ Plone/Zope Database Interfaces ... http://zope.egenix.com/


2016-03-07: Released eGenix pyOpenSSL 0.13.14 ... http://egenix.com/go89 2016-02-19: Released eGenix PyRun 2.1.2 ... http://egenix.com/go88

::: We implement business ideas - efficiently in both time and costs :::

eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ http://www.malemburg.com/



More information about the Python-Dev mailing list