[Python-Dev] PEP 383 and GUI libraries (original) (raw)

Zooko O'Whielacronx zookog at gmail.com
Sat May 2 03:42:47 CEST 2009


Folks:

Being new to the use of gmail, I accidentally sent the following only to MvL and not to the list. He promptly replied with a helpful counterexample showing that my design can suffer collisions. :-)

Regards,

Zooko

On Fri, May 1, 2009 at 10:38 AM, "Martin v. Löwis" <martin at v.loewis.de> wrote:

Requirement: either the unicode string or the bytes are faithfully transmitted from one system to another. I don't understand this requirement very well, in particular not the "faithfully" part. That is: if you read a filename from the filesystem, and transmit that filename to another system and use it, then there are two cases: What do you mean by "use it"? Things like opening files? How does that work? In general, a file name valid on one system is invalid on a different system - or, at least, refers to a different file over there. This is independent of encodings.

Tahoe is a backup and filesharing program, so you might for example, execute "tahoe cp -r Motörhead tahoe:" to copy all the contents of your "Motörhead" directory to your Tahoe filesystem. Later you or a friend, might execute "tahoe cp -r tahoe:Motörhead ." to copy everything from that directory within your Tahoe filesystem to your local filesystem. So in this case the flow of information is local_system_1 -> Tahoe -> local_system_2.

The Requirement 1 is that for each filename encountered which is a valid encoding in local_system_1, then the resulting (unicode) name is transmitted through the Tahoe filesystem and then written out into local_system_2 in the expected way (i.e. just by using the Python unicode APIs and passing the unicode object to them).

Requirement 2 is that for each filename encountered which is not a valid encoding in local_system_1, then the original bytes are transmitted through the Tahoe filesystem and then, if the target system is a byte-oriented system such as Linux, the original bytes are written into the target filesystem. (If the target is not Linux then mojibake! but we don't have to go into that now.)

Does that make sense?

In all your descriptions, I'm puzzled as to where exactly you get the source bytes from. If you use the PEP 383 interfaces, you will start with character strings, not byte strings, always.

On Mac and Windows, we use the Python unicode APIs e.g. os.listdir(u"Motörhead"). On Linux and Solaris, we use the Python bytestring APIs e.g. os.listdir("Motörhead".encode(sys.getfilesystemencoding())).

Okay, I find it surprisingly easy to make subtle errors in this encoding stuff, so please let me know if you spot one. Is it true that srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', 'python-escape') will always produce srcbytes ? I think you mixed up bytes and unicode here: if srcbytes is indeed a bytes object, then you can't apply .encode to it.

Yep, I reversed the order of encode() and decode(). However, my whole statement was utterly wrong and shows that I still didn't fully get it yet. I have flip-flopped again and currently think that PEP 383 is useless for this use case and that my original plan [1] is still the way to go. Please let me know if you spot a flaw in my plan or a ridiculousity in my requirements, or if you see a way that PEP 383 can help me.

Thank you very much.

Regards,

Zooko

[1] http://allmydata.org/trac/tahoe/ticket/534#comment:47



More information about the Python-Dev mailing list