(original) (raw)

On 1/26/2011 4:47 PM, Toshio Kuratomi wrote:

There's one further case that I am worried about that has no real "transfer". Since people here seem to think that unicode module names are the future (for instance, the comments about redefining the C locale to include utf-8 and the comments about archiving tools needing to support encoding bits), there are eventually going to be unicode modules that become dependencies of other modules and programs. These will need to be installed on systems. Linux distributions that ship these will need to choose a filesystem encoding for the filenames of these. Likely the sensible thing for them to do is to use utf-8 since all the ones I can think of default to utf-8\. But, as Stephen and Victor have pointed out, users change their locale settings to things that aren't utf-8 and save their modules using filenames in that encoding. When they update their OS to a version that has utf-8 python module names, they will find that they have to make a choice. They can either change their locale settings to a utf-8 encoding and have the system installed modules work or they can leave their encoding on their non-utf-8 encoding and have the modules that they've created on-site work.

This is not a good position to put users of these systems in.

The way this case should work, is that programs that install files
(installation is a form of transfer) should transform their names
from the encoding used in the transfer medium to the encoding of the
filesystem on which they are installed.

Python3 should access the files, transforming the names from the
encoding of the filesystem on which they are installed to Unicode
for use by the program.

I think Python3 is trying to do its part, and Victor is trying to
make that more robust on more platforms, specifically Windows.

The programs that install files, which may include programs that
install Python files I don't know, may or may not be doing their
part, but clearly there are cases where they do not.

Systems that have different encodings for names on the same or
different file systems need to have a way to obtain the encoding for
the file names, so they can be properly decoded. If they don't have
such a way, they are broken.

=====

The rest of this is an attempt to describe the problem of Linux and
other systems which use byte strings instead of character strings as
file names. No problem, as long as programs allow byte strings as
file names. Python3 does not, for the import statement, thus the
problem is relevant for discussion here, as has been ongoing.

=====

Since file names are defined to be byte strings, there is no way to
obtain the encoding for file names, so they cannot always be
decoded, and sometimes not properly decoded, because no one knows
which encoding was used to create them, _if any_.

Hence, Linux programs that use character strings as file names
internally and expect them to match the byte strings in the file
system are promoting a fiction: that there is a transformation
(encoding) from character strings to byte strings that will match.

When using ASCII character strings, they can be transformed to bytes
using a simple transformation: identity... but that isn't
necessarily correct, if the files were created using EBCDIC
(unlikely on Linux systems, but not impossible, since Linux files
are byte strings).

When using non-ASCII character strings, the fiction promoted is even
bigger, and the transformation even harder. Any 8-bit character
encoding can pretend that identity is the correct transformation,
but the result is mojibake if it isn't. Unicode other multi-byte
encodings have an even harder job, because there can be 8-bit
sequences that are not legal for some transformations, but are legal
for others. This is when the fiction is exposed!

As the recent description of glib points out, when the file names
are read as bytes, and shown to the user for selection, possibly
using some mojibake-generating transformation to characters, the
user has a fighting chance to pick the right file, less chance if
the transformation is lossy ('?' substitutions, etc.) and/or the
names are redundant in their lossless characters.

However, when the specification of the name is in characters (such
as for Python import, or file names specified as character constants
in any application system that provides/permits such), and there are
large numbers of transformations that could be used to convert
characters to bytes, the problem is harder, and error-prone...
programs that want to promote the fiction of using characters for
filenames must work harder. It seems that Python on Linux is such a
program.

One technique is to have conventions agreed on by applications and
users to limit the number of encodings used on a particular system
to one (optimal) or a few, the latter requires understanding that
files created in one encoding may not be accessible by systems that
use a different one... until they are renamed. Subsets of
applications and users can the happily share files with others of
their encoding, and with the subset of files that can be decoded
successfully by their encoding, even though it is not correct.
(often ASCII, or a few mojibake characters learned for cross-subset
usage.) When multiple encodings are used without such conventions,
chaos results.

Another technique that would be amusing is to use Base64 (as Oleg
suggested), URL-encoding, or some other mapping that transforms
non-ASCII names to ASCII character sequences and the identity
mapping to obtain bytes, and then Python could ship such files to
any system, as long as it always included that mapping as one of the
encodings it would try to find files. This would probably be the
most powerful solution, but would only need to be applied to those
systems that do not use characters for filenames. It could, in
fact, be applied on any system that uses a subset of characters for
filenames, and hence transcends the need for Unicode support in a
file system to use Unicode names in Python3 import statements. It
would likely be problematical for use with 3rd-party libraries,
however.

Another technique would be to try each possible encoding in turn, in
some defined order, and the filesystem searched for that byte string
as a file name, possibly matching files that shouldn't have been
matched. To limit that search, such programs could allow
configuration of an smaller ordered list of encodings to be tried to
limit the search, and a specific one to be used for the creation of
new files; this opens up the possibility of not trying the "right"
encoding, for some rogue file name.

This would be an issue and implementation for Linux systems, but
would not need to be used on systems such as MacOS (which defines a
particular encoding) or Windows (which defines a particular
encoding) etc. When mounting filesystems that use byte string file
names on systems with a define encoding, it should be the
responsibility of the mounting system to do such transformations,
and possibly have such configurations, and possibly have mappings or
renaming facilities, and possibly prohibit access to files whose
names cannot be transformed (of course, one can always punt by
configuring latin-1 or other encodings that can match any byte
string, but that produces mojibake, and then there is no surety that
particular files will appear to have the name that programs expect).

Of course, Victor's patch is addressing Windows issues, and Windows
has defined encodings, it is just a matter of using the proper APIs
to see them, and should be accepted.

It sounds like the current situation on Linux is that Python can
access the subset of files that match the locale encoding for which
it is run. It sounds like it would be inappropriate for Python to
begin shipping files with non-ASCII names as part of its Linux
distribution, unless facilities are created or tools used to remap
non-ASCII names to the local locale encoding. Locales that are not
ASCII supersets (in character repertoire, not encoding) could not be
supported. Locales that do not support all the characters used in
files shipped with Python could not be supported. Since locales
vary wildly in their available non-ASCII names, that limits Python
eithr to shipping ASCII names only, or restricting the locales that
are supported to those that support the characters used.

I suppose that Victor's patch would point out most or all the
places where such transformations would have to be implemented, if
it is important to support systems having byte string file names
whose users cannot agree to use a single encoding for transforming
to/from characters.