[Python-Dev] Adding the 'path' module (was Re: Some RFE for review) (original) (raw)

M.-A. Lemburg mal at egenix.com
Tue Jul 12 10:37:14 CEST 2005

Previous message: [Python-Dev] Adding the 'path' module (was Re: Some RFE for review)
Next message: [Python-Dev] Adding the 'path' module (was Re: Some RFE for review)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Neil,

2) Return unicode when the text can not be represented in ASCII. This will cause a change of behaviour for existing code which deals with non-ASCII data.

+1 on this one (s/ASCII/Python's default encoding). I assume you mean the result of sys.getdefaultencoding() here.

Yes.

The default encoding is the encoding that Python assumes when auto-converting a string to Unicode. It is normally set to ASCII, but a user may want to use a different encoding.

However, we've always made it very clear that the user is on his own when chainging the ASCII default to something else.

Unless much of the Python library is modified to use the default encoding, this will break. The problem is that different implicit encodings are being used for reading data and for accessing files. When calling a function, such as open, with a byte string, Python passes that byte string through to Windows which interprets it as being encoded in CPACP. When this differs from sys.getdefaultencoding() there will be a mismatch.

As I said: code pages are evil :-)

Say I have been working on a machine set up for Australian English (or other Western European locale) but am working with Russian data so have set Python's default encoding to cp1251. With this simple script, g.py:

import sys print file(sys.argv[1]).read() I process a file called '€.txt' with contents "European Euro" to produce C:\zed>pythond g.py €.txt European Euro With the proposed modification, sys.argv[1] u'\u20ac.txt' is converted through cp1251

Actually, it is not: if you pass in a Unicode argument to one of the file I/O functions and the OS supports Unicode directly or at least provides the notion of a file system encoding, then the file I/O should use the Unicode APIs of the OS or convert the Unicode argument to the file system encoding. AFAIK, this is how posixmodule.c already works (more or less).

I was suggesting that OS filename output APIs such as os.listdir() should return strings, if the filename matches the default encoding, and Unicode, if not.

On input, file I/O APIs should accept both strings using the default encoding and Unicode. How these inputs are then converted to suit the OS is up to the OS abstraction layer, e.g. posixmodule.c.

Note that the posixmodule currently does not recode string arguments: it simply passes them to the OS as-is, assuming that they are already encoded using the file system encoding. Changing this is easy, though: instead of using the "et" getargs format specifier, you'd have to use "es". The latter recodes strings based on the default encoding assumption to whatever other encoding you specify.

to '\x88.txt' as the Euro is located at 0x88 in CP1251. The operating system is then asked to open '\x88.txt' which it interprets through CPACP to be u'\u02c6.txt' ('ˆ.txt') which then fails. If you are very unlucky there will be a file called 'ˆ.txt' so the call will succeed and produce bad data.

Simulating with str(sys.argvu[1]): C:\zed>pythond g.py €.txt Traceback (most recent call last): File "g.py", line 2, in ? print file(str(sys.argvu[1])).read() IOError: [Errno 2] No such file or directory: '\x88.txt'

See above: this is what I'd consider a bug in posixmodule.c

-1: code pages are evil and the reason why Unicode was invented in the first place. This would be a step back in history.

Features used to specify files (sys.argv, os.environ, ...) should match functions used to open and perform other operations with files as they do currently. This means their encodings should match.

Right. However, most of these APIs currently either don't make any assumption on the strings contents and simply pass them around, or they assume that these strings use the file system encoding - which, like in the example you gave above, can be different from the default encoding.

To untie this Gordian Knot, we should use strings and Unicode like they are supposed to be used (in the context of text data):

strings are fine for text data that is encoded using the default encoding
Unicode should be used for all text data that is not or cannot be encoded in the default encoding

Later on in Py3k, all text data should be stored in Unicode and all binary data in some new binary type.

-- Marc-Andre Lemburg eGenix.com

Professional Python Services directly from the Source (#1, Jul 12 2005)

Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ::::

Previous message: [Python-Dev] Adding the 'path' module (was Re: Some RFE for review)
Next message: [Python-Dev] Adding the 'path' module (was Re: Some RFE for review)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list