[Python-Dev] Import and unicode: part two (original) (raw)

Victor Stinner victor.stinner at haypocalc.com
Thu Jan 20 03:51:05 CET 2011

Previous message: [Python-Dev] Import and unicode: part two
Next message: [Python-Dev] Import and unicode: part two
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Le mercredi 19 janvier 2011 à 18:07 -0800, Toshio Kuratomi a écrit :

Saying that multiple encodings on a single system is a misconfiguration every time it comes up does not make it true.

Yes, each filesystem can have its own encoding. For example, this is supported by Linux. Python doesn't support such configuration, but this limitation is wider than the import machinery. If you consider it import enough, please open an issue.

To the existing list I'd add getting a package from pypi -- neither tar nor zip files contain encoding information about the filenames.

ZIP contain a flag to indicate the encoding: cp437 or UTF-8.

TAR has an extension called "PAX" which stores filenames as UTF-8. But yes, most tarballs store filenames as raw byte strings.

Anyway, if you would like to share your code on PyPI, you should not use non-ASCII module names (or any other non-ASCII name/identifier :-)).

Python 3 supports non-ASCII identifiers (PEP 3131), but the developer is responsible to decide if (s)he uses it or not, depending on its audience. For a lesson at school, it is nice to write examples in the mother language, instead of using "raw" english with ASCII identifiers and filenames. In a school, you can use the same configuration (encoding) on all computers.

> > * Specify an encoding per platform and stick to that. > > It doesn't work: on UNIX/BSD, the user chooses its own encoding and all > programs will use it. > (...) This prevents getting a mixture of encodings of modules (...)

If you have an issue with encodings, when have to fix it when you create a module (on disk), not when you load a module (it is too late).

(...) I mean something at the python code level::

import café encodedas('latin1')

Import a module using its byte name? You mean that café filename was not encoded to the Python filesystem encoding, but to other (wrong) encoding, at the creation of the module. As written before, you should fix your filename, instead of using an (ugly) workaround in Python.

I haven't looked at your patch so perhaps you have an ingenous method of translating from the unicode representation of the module in the import statement to the bytes in arbitrary encodings on the filesystem that I haven't thought of.

On Windows, My patch tries to avoid any conversion: it uses unicode everywhere.

On other OSes, it uses the Python filesystem encoding to encode a module name (as it is done for any other operation on the filesystem with an unicode filename).

Python 3 supports bytes filename to be able to read/copy/delete undecodable filenames, filenames stored in a encoding different than the system encoding, broken filenames. It is also possible to access these files using PEP 383 (with surrogate characters). This is useful to use Python on an old system.

If you don't, however, then really - ASCII-only seems like the sanest of the three solutions I can think of.

But a (Python 3) module is not supposed to have a broken filename. If it is the case, you have better to fix its name, instead of trying to fix the problem later (in Python).

With UTF-8 filesystem encoding (eg. on Mac OS X, and most Linux setups), it is already possible to use non-ASCII module names.

Victor

Previous message: [Python-Dev] Import and unicode: part two
Next message: [Python-Dev] Import and unicode: part two
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list