[Python-Dev] My work on Python3 and non-ascii paths is done (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Tue Oct 19 03:53:34 CEST 2010

Previous message: [Python-Dev] Digital video basics tutorial
Next message: [Python-Dev] My work on Python3 and non-ascii paths is done
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Seven months after my first commit related to this issue, the full test suite of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non- ascii source directory. It means that Python 3.2 now process correctly filenames in all modules, build scripts and other utilities, with any locale encoding.

General changes:

Encode/decode filenames with the locale encoding, instead of utf-8, until the filesystem is set
mbcs encoding (Windows filesystem encoding) is now strict by default, whereas it ignores unencodable characters and replace undecodable bytes in Python 3.1. Old behaviour can still be used using the right error handler: 'ignore' to encode, 'replace' to decode.
tarfile uses utf-8 encoding on Windows (instead of mbcs), and the surrogateescape error handler on all OSes
sys.getfilesystemencoding() cannot be None anymore
Don't accept bytearray as filenames anymore

Changes of the Python API:

Add os.environb: bytes version of os.environ, os.getenvb() function and os.supports_bytes_environ constant
Add os.fsencode() and os.fsdecode() functions
Remove sys.setfilesystemencoding() function

Changes of the C API:

Add PyUnicode_EncodeFSDefault() function
Add PyUnicode_FSDecoder() ParseTuple converter
Add PySys_FormatStdout(), PySys_FormatStderr() and PyErr_WarnFormat() functions
Add PyUnicode_AsWideCharString() function: don't need a buffer size.
Add Py_UNICODE_strrchr(), Py_UNICODE_strcat(), PyUnicode_AsUnicodeCopy() and Py_UNICODE_strncmp() functions
PyUnicode_DecodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize() use the surrogateescape error handler
File utilities: add _Py_wchar2char() (reverse of Py_char2wchar()), _Py_stat() and _Py_fopen() functions; move all file utilities to Python/fileutils.c
The format string of PyUnicode_FromFormat() and PyErr_Format() is now pure ASCII: raise an error on non-ascii character
PyUnicode_FSConverter() doesn't accept bytearray anymore

Bugfixes:

Fix modules: tarfile, pickle, pickletools, ctypes, subprocess, bz2, ssl, profile, xmlrpclib, platform, libpython (gdb plugin), sqlite, distutils.log, locale, _warnings, zipimport, imp
Fix functions: os.exec*(), os.system(), ctypes.dlopen(), os.getenv(), os.get_exec_path()
Fix tests: test_gdb, test_httpservers, test_cmd_line, test_size, test_generic_path, test_subprocess, test_doctest, test_cmd_line_script
Fix utf-8 encoder to support error handlers producing unicode string (eg. 'backslashreplace')
Fix conversion from unicode to a wide character string if Py_UNICODE and wchar_t have different sizes: UTF-16 => UTF-32 or UTF-32 => UTF-16
Fix Python command line parser if the the command line contains surrogates
Avoid _PyUnicode_AsString() because it returns NULL if the string contains surrogates, or catch the error
Fix regrtest.py to support surrogate characters in the current working directory and in the tracebacks

I wrote also some tests and documentation.

The most difficult part was to debug Python initialization (Py_InitializeEx and calculate_path) and the import machinery (import.c, zipimport.c), because gdb does sometimes crash (for various reasons) and because the import machinery is fragile and difficult to understand.

A special thanks to Marc-Andre Lemburg, Martin v. Löwis, Antoine Pitrou and Amaury Forgeot d'Arc for their help, useful advices and code reviews!

-- Bonus: short story of PYTHONFSENCODING ---

In the middle of August, I created the PYTHONFSENCODING environment variable, as suggested by Marc-Andre Lemburg. Because of this variable and because Python used utf-8 until the filesystem encoding is known, I had to write ugly and fragile "redecode" functions to redecode all filenames of all objects (sys.path, sys.meta_path, sys.executable, sys.modules, all code objects, etc.).

Then I found 4 issues related to PYTHONFSENCODING, inconsistencies between the filesystem encoding and the locale encoding. It was not easy to decide how to fix these issues, but at the end, we choosed to drop PYTHONFSENCODING variable, use the locale encoding as the filesystem encoding, and always use utf-8 as the filesystem encoding on Mac OS X.

-- Victor Stinner http://www.haypocalc.com/

Previous message: [Python-Dev] Digital video basics tutorial
Next message: [Python-Dev] My work on Python3 and non-ascii paths is done
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list