[Python-Dev] My work on Python3 and non-ascii paths is done (original) (raw)
Victor Stinner victor.stinner at haypocalc.com
Tue Oct 19 03:53:34 CEST 2010
- Previous message: [Python-Dev] Digital video basics tutorial
- Next message: [Python-Dev] My work on Python3 and non-ascii paths is done
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
Seven months after my first commit related to this issue, the full test suite of Python 3.2 pass with ASCII, ISO-8859-1 and UTF-8 locale encodings in a non- ascii source directory. It means that Python 3.2 now process correctly filenames in all modules, build scripts and other utilities, with any locale encoding.
General changes:
- Encode/decode filenames with the locale encoding, instead of utf-8, until the filesystem is set
- mbcs encoding (Windows filesystem encoding) is now strict by default, whereas it ignores unencodable characters and replace undecodable bytes in Python 3.1. Old behaviour can still be used using the right error handler: 'ignore' to encode, 'replace' to decode.
- tarfile uses utf-8 encoding on Windows (instead of mbcs), and the surrogateescape error handler on all OSes
- sys.getfilesystemencoding() cannot be None anymore
- Don't accept bytearray as filenames anymore
Changes of the Python API:
- Add os.environb: bytes version of os.environ, os.getenvb() function and os.supports_bytes_environ constant
- Add os.fsencode() and os.fsdecode() functions
- Remove sys.setfilesystemencoding() function
Changes of the C API:
- Add PyUnicode_EncodeFSDefault() function
- Add PyUnicode_FSDecoder() ParseTuple converter
- Add PySys_FormatStdout(), PySys_FormatStderr() and PyErr_WarnFormat() functions
- Add PyUnicode_AsWideCharString() function: don't need a buffer size.
- Add Py_UNICODE_strrchr(), Py_UNICODE_strcat(), PyUnicode_AsUnicodeCopy() and Py_UNICODE_strncmp() functions
- PyUnicode_DecodeFSDefault() and PyUnicode_DecodeFSDefaultAndSize() use the surrogateescape error handler
- File utilities: add _Py_wchar2char() (reverse of Py_char2wchar()), _Py_stat() and _Py_fopen() functions; move all file utilities to Python/fileutils.c
- The format string of PyUnicode_FromFormat() and PyErr_Format() is now pure ASCII: raise an error on non-ascii character
- PyUnicode_FSConverter() doesn't accept bytearray anymore
Bugfixes:
- Fix modules: tarfile, pickle, pickletools, ctypes, subprocess, bz2, ssl, profile, xmlrpclib, platform, libpython (gdb plugin), sqlite, distutils.log, locale, _warnings, zipimport, imp
- Fix functions: os.exec*(), os.system(), ctypes.dlopen(), os.getenv(), os.get_exec_path()
- Fix tests: test_gdb, test_httpservers, test_cmd_line, test_size, test_generic_path, test_subprocess, test_doctest, test_cmd_line_script
- Fix utf-8 encoder to support error handlers producing unicode string (eg. 'backslashreplace')
- Fix conversion from unicode to a wide character string if Py_UNICODE and wchar_t have different sizes: UTF-16 => UTF-32 or UTF-32 => UTF-16
- Fix Python command line parser if the the command line contains surrogates
- Avoid _PyUnicode_AsString() because it returns NULL if the string contains surrogates, or catch the error
- Fix regrtest.py to support surrogate characters in the current working directory and in the tracebacks
I wrote also some tests and documentation.
The most difficult part was to debug Python initialization (Py_InitializeEx and calculate_path) and the import machinery (import.c, zipimport.c), because gdb does sometimes crash (for various reasons) and because the import machinery is fragile and difficult to understand.
A special thanks to Marc-Andre Lemburg, Martin v. Löwis, Antoine Pitrou and Amaury Forgeot d'Arc for their help, useful advices and code reviews!
-- Bonus: short story of PYTHONFSENCODING ---
In the middle of August, I created the PYTHONFSENCODING environment variable, as suggested by Marc-Andre Lemburg. Because of this variable and because Python used utf-8 until the filesystem encoding is known, I had to write ugly and fragile "redecode" functions to redecode all filenames of all objects (sys.path, sys.meta_path, sys.executable, sys.modules, all code objects, etc.).
Then I found 4 issues related to PYTHONFSENCODING, inconsistencies between the filesystem encoding and the locale encoding. It was not easy to decide how to fix these issues, but at the end, we choosed to drop PYTHONFSENCODING variable, use the locale encoding as the filesystem encoding, and always use utf-8 as the filesystem encoding on Mac OS X.
-- Victor Stinner http://www.haypocalc.com/
- Previous message: [Python-Dev] Digital video basics tutorial
- Next message: [Python-Dev] My work on Python3 and non-ascii paths is done
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]