[Python-Dev] Adding the 'path' module (was Re: Some RFE for review) (original) (raw)

Neil Hodgson nyamatongwe at gmail.com
Mon Jul 11 15:55:34 CEST 2005


Guido van Rossum:

In some sense the safest approach from this POV would be to return Unicode as soon as it can't be encoded using the global default encoding. IOW normally this would return Unicode for all names containing non-ASCII characters.

On unicode versions of Windows, for attributes like os.listdir, os.getcwd, sys.argv, and os.environ, which can usefully return unicode strings, there are 4 options I see:

  1. Always return unicode. This is the option I'd be happiest to use, myself, but expect this choice would change the behaviour of existing code too much and so produce much unhappiness.

  2. Return unicode when the text can not be represented in ASCII. This will cause a change of behaviour for existing code which deals with non-ASCII data.

  3. Return unicode when the text can not be represented in the default code page. While this change can lead to breakage because of combining byte string and unicode strings, it is reasonably safe from the point of view of data integrity as current code is returning garbage strings that look like '?????'.

  4. Provide two versions of the attribute, one with the current name returning byte strings and a second with a "u" suffix returning unicode. This is the least intrusive, requiring explicit changes to code to receive unicode data. For patch #1231336 I chose this approach producing sys.argvu and os.environu.

    For os.listdir the current behaviour of returning unicode when its

argument is unicode can be retained but that is not extensible to, for example, sys.argv.

Since this issue may affect many attributes a common approach should be chosen.

For experimenting with os.listdir, there is a patch for posixmodule.c at http://www.scintilla.org/difft.txt which implements (2). To specify the US-ASCII code page, the number 20127 is used as there is no definition for this in the system headers. To change to (3) comment out the line with 20127 and uncomment the line with CP_ACP. Unicode arguments produce unicode results.

Neil



More information about the Python-Dev mailing list