Issue 1608805: Py_FileSystemDefaultEncoding can be non-canonical (original) (raw)
On Linux/Unix it is possible for Py_FileSystemDefaultEncoding to be set to a non-canonical encoding such as "UTF-8" instead of "utf-8". This happens when it is set from codeset in Py_InitializeEx() in pythonrun.c.
This becomes a problem when this value is propagated through to PyUnicode_Decode() or PyUnicode_AsEncodedString() in unicodeobject.c. One possible such code path starts in os.listdir() via PyUnicode_FromEncodedObject()).
In that case, the common case optimizations fail. I noticed this in a case where the PyCodec_Decode() used instead was failing. Normally I think this just amounts to broken optimization but given the likelihood of other such code being added in the future, I feel it's best to fix Py_FileSystemDefaultEncoding to always be a canonical form.
One possible way to fix it is attached as a patch.
It appears to be specific to 2.x and does not occur under Python 3.0:
Python 3.0 (r30:67503, Jan 15 2009, 09:27:16) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.getfilesystemencoding() 'utf-8'
Python 2.6.1 (r261:67515, Dec 11 2008, 11:59:39) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.getfilesystemencoding() 'UTF-8'
Python 2.5.4 (r254:67916, Mar 16 2009, 09:34:35) [GCC 4.0.3 (Ubuntu 4.0.3-1ubuntu5)] on linux2 Type "help", "copyright", "credits" or "license" for more information.
import sys sys.getfilesystemencoding() 'UTF-8'
(This is on a Ubuntu system where LANG=en_US.UTF-8 is the default)