[Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3? (original) (raw)
M.-A. Lemburg mal at egenix.com
Wed Jun 29 12:20:42 CEST 2011
- Previous message: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?
- Next message: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Victor Stinner wrote:
Le mercredi 29 juin 2011 à 10:18 +0200, M.-A. Lemburg a écrit :
Victor Stinner wrote:
Le mardi 28 juin 2011 à 16:02 +0200, M.-A. Lemburg a écrit :
How about a more radical change: have open() in Py3 default to opening the file in binary mode, if no encoding is given (even if the mode doesn't include 'b') ?
I tried your suggested change: Python doesn't start. No surprise there: it's an incompatible change, but one that undoes a wart introduced in the Py3 transition. Guessing encodings should be avoided whenever possible. It means that all programs written for Python 3.0, 3.1, 3.2 will stop working with the new 3.x version (let say 3.3). Users will have to migrate from Python 2 to Python 3.2, and then migration from Python 3.2 to Python 3.3 :-(
I wasn't suggesting doing this for 3.3, but we may want to start the usual feature change process to make the change eventually happen.
I would prefer a ResourceWarning (emited if the encoding is not specified), hidden by default: it doesn't break compatibility, and -Werror gives exactly the same behaviour that you expect.
ResourceWarning is the wrong type of warning for this. I'd suggest to use a UnicodeWarning or perhaps create a new EncodingWarning instead.
This demonstrates that Python's stdlib is still not being explicit about the encoding issues. I suppose that things just happen to work because we mostly use ASCII files for configuration and setup. I did more tests. I found some mistakes and sometimes the binary mode can be used, but most function really expect the locale encoding (it is the correct encoding to read-write files). I agree that it would be to have an explicit encoding="locale", but make it mandatory is a little bit rude.
Again: Using a locale based default encoding will not work out in the long run. We've had those discussions many times in the past.
I don't think there's anything bad with having the user require to set an encoding if he wants to read text. It makes him/her think twice about the encoding issue, which is good.
And, of course, the stdlib should start using this explicit-is-better-than-implicit approach as well.
Then I tried my suggestion (use "utf-8" by default): Python starts correctly, I can build it (run "make") and... the full test suite pass without any change. (I'm testing on Linux, my locale encoding is UTF-8.)
I bet it would also with "ascii" in most cases. Which then just means that the Python build process and test suite is not a good test case for choosing a default encoding. Linux is also a poor test candidate for this, since most user setups will use UTF-8 as locale encoding. Windows, OTOH, uses all sorts of code page encodings (usually not UTF-8), so you are likely to hit the real problem cases a lot easier. I also ran the test suite on my patched Python (open uses UTF-8 by default) with ASCII locale encoding (LANG=C), the test suite does also pass. Many tests uses non-ASCII characters, some of them are skipped if the locale encoding is unable to encode the tested text.
Thanks for checking. So the build process and test suite are indeed not suitable test cases for the problem at hand. With just ASCII files to decode, Python will simply never fail to decode the content, regardless of whether you use an ASCII, UTF-8 or some Windows code page as locale encoding.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Jun 29 2011)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
- Previous message: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?
- Next message: [Python-Dev] open(): set the default encoding to 'utf-8' in Python 3.3?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]