[Python-Dev] My work on Python3 and non-ascii paths is done (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Wed Oct 20 02:11:35 CEST 2010

Previous message: [Python-Dev] My work on Python3 and non-ascii paths is done
Next message: [Python-Dev] My work on Python3 and non-ascii paths is done
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Le mardi 19 octobre 2010 16:12:56, Barry Warsaw a écrit :

Going forward, is there adequate documentation, guidelines, and safeguards for future coders so that they Do The Right Thing with new code? Perhaps a short How To in the standard documentation would be helpful, with links to it from any old/bad API calls?

Hum, as usual, I suggest to decode all inputs to unicode as early as possible, and encode back to bytes (or other native format) at the last moment. For filenames, it means that PyUnicode_FSDecoder() is better than PyUnicode_FSConverter(), because it gives an unicode object (instead of byte string) and so the function will support unencodable characters.

Use PyUnicode_EncodeFSDefault() / PyUnicode_DecodeFSDefault() and os.fsencode() / os.fsdecode() to encode/decode filenames instead of your own function, to support the PEP 383 (undecodable bytes <=> surrogate characters).

Be also careful to support undecodable bytes (on OSes other than Windows), eg. try a filename with a non-ASCII character with the C locale (ASCII locale encoding). Even with utf-8 filesystem encoding, this problem may occurs with a system not correclty configured (eg. USB key with the FAT fileystem using the "wrong" encoding).

If you would like to avoid all encoding issues on filenames on UNIX/BSD, use bytes: os.environb, os.listdir(b'.'), os.getcwdb(), etc.

Be careful with the utf-8 codec: its default mode (strict error handler) refuses to encode surrogate characters. Eg. print(filename) may raise a UnicodeEncodeError. Use repr(filename) to escape surrogate characters.

I plan to fix Python documentation: specify the encoding used to decode all byte string arguments of the C API. I already wrote a draft patch: issue #9738. This lack of documentation was a big problem for me, because I had to follow the function calls to get the encoding.

-- Victor Stinner http://www.haypocalc.com/

Previous message: [Python-Dev] My work on Python3 and non-ascii paths is done
Next message: [Python-Dev] My work on Python3 and non-ascii paths is done
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list