[Python-Dev] PEP 383 and Tahoe [was: GUI libraries] (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Sun May 3 11:32:38 CEST 2009

Previous message: [Python-Dev] PEP 383 and GUI libraries
Next message: [Python-Dev] PEP 383 and Tahoe [was: GUI libraries]
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Zooko O'Whielacronx writes:

However, it is moot because Tahoe is not a new system. It is currently at v1.4.1, has a strong policy of backwards-compatibility, and already has lots of data, lots of users, and programmers building on top of it.

Cool!

Question: is there a way to negotiate versions, or better yet, features?

I see I'm not explaining the Tahoe requirements clearly. It's probably that I'm not understanding them clearly myself.

Well, it's a high-dimensional problem. Keeping track of all the variables is hard. That's why something like PEP 383 can be important to you even though it's only a partial solution; it eliminates one variable.

Suppose you have run "tahoe cp -r myfiles/ tahoe:" on a Linux system and then you inspect the files in the Tahoe filesystem, such as by examining the web interface [1] or by running "tahoe ls", either of which you could do either from the same machine where you ran "tahoe cp" or from a different machine (which could be using any operating system). We have the following requirements about what ends up in your Tahoe directory after that cp -r.

Whoa! Slow down! Where's "my" "Tahoe directory"? Do you mean the directory listing? A copy to whatever system I'm on? The bytes that the Tahoe host has just loaded into a network card buffer to tell me about it? The bytes on disk at the Tahoe host? You'll find it a lot easier to explain things if you adopt a precise, consistent terminology.

Requirement 1 (unicode): Each filename that you see needs to be valid unicode

What does "see" mean? In directory listings? Under what circumstances, if any, can what I see be different from what I get?

Requirement 2 (faithful if unicode): For each filename (byte string) in your myfiles directory,

My local myfiles directory, or my Tahoe myfiles directory?

if that bytestring is the valid encoding of some string in your stated locale,

Who stated the locale? How? Are you referring to what getfilesystemencoding returns? This is a "(unicode) string", right?

then the resulting filename in Tahoe is that (unicode) string. Nobody ever doesn't want this, right? Well, maybe some people don't want this sometimes, [...]. However, what's the alternative? Guessing that their locale shouldn't be set to latin-1 and instead decoding their bytes some other way?

Sure. Emacsen do that, you know. Of course it's hard to guess something else if ISO-8859/1 is the preferred encoding, but it does happen. This probably cannot be done accurately enough for Tahoe, though.

It seems like we're not going to do better than requirement 2 (faithful if unicode).

Requirement 3 (no file left behind): For each filename (byte string) in your myfiles directory, whether or not that byte string is the valid encoding of anything in your stated locale, then that file will be added into the Tahoe filesystem under some name (a good candidate would be mojibake, e.g. decode the bytes with latin-1, but that is not the only possibility).

That's not even a possibility, actually. Technically, Latin-1 has a "hole" from U+0080 to U+009F. You need to add the C1 controls to fill in that gap. (I don't think it actually matters in practice, everybody seems to implement ISO-8859/1 as though it contained the control characters ... except when detecting encodings ... but it pays to be precise in these things ....)

Now already we can say that these three requirements mean that there can be collisions -- for example a directory could have two entries, one of which is not a valid encoding in the locale, and whatever unicode string we invent to name it with in order to satisfy requirements 3 (no file left behind) and 1 (unicode) might happen to be the same as the (correctly-encoded) name of the other file.

This is false with rather high probability, but you need some extra structure to deal with it. First, claim the Unicode private planes for Tahoe. Then allocate characters from the private planes on demand as encountered, including such characters encountered in external file names to be stored in Tahoe and the surrogates used by PEP 383. "Display names" using these private characters would be valid Unicode, but not very useful. However, an algorithmically generated font (like the 4-hex-digit-square used to give a glyph to unknown code points in the BMP) could be used by those who care.

Also store mappings from (system encoding, UTF-8b representation) to private char and back. For simplicity, that could be global on your server (IIRC, there are at least two private planes up there, so you'd need to run into almost 128Ki unique such characters to run out).

I guess you'd be subject to a DOS attack where somebody decided to map all of 80000-odd CNS characters into private space, and then write 80000 files, each with a different 1-character name ....

Note that Martin does not do this in PEP 383 because PEP 383 only cares about the semantics that a filename read from a directory can be used to access the file associated with it in that directory. For that, a private, non-Unicode encoding is perfectly acceptable. But you want valid Unicode. This scheme gives it to you.

The registry of characters is somewhat unpleasant, but it does allow you to detect filenames that are the same reliably.

Possible Requirement 4 (faithful bytes if not unicode, a.k.a. "round-tripping"):

PEP 383 gives you this, but you must store the encoding used for each such file name.

One reason to be skeptical is that about a third of the Russian files will happen to decode cleanly as shift-jis anyway, and will therefore come out as something entirely different if the target filesystem's encoding is something other than shift-jis.

The only way to handle this is to store the encoding used to convert to Unicode as part of every file's metadata. This could be also used in Tahoe to warn the user that the current system encoding does not match the alleged_encoding used to make the backup. Some users might prefer to use the alleged_encoding on restore.

But an even worse problem -- the show-stopper for me -- is that I don't want what Tahoe shows when you do "tahoe ls" or view it in a web browser to differ from what it writes out when you do "tahoe cp -r tahoe: newfiles/".

But as a requirement, that's incoherent. What you are "seeing" is Unicode, what it will write out is bytes. That means that if multiple locales are in use on both the backup and restore systems, and the nominal system encodings are different, people whose personal default locales are not the same as the system's will see what they expect on the backup system (using system ls), mojibake on Tahoe (using tahoe ls), and different mojibake on the restore system (system ls, again).

Note that "use Tahoe, not system, ls" doesn't help at all (unless the weirdo has learned to read mojibake, which actually does happen, but it's not worth betting on).

How likely is that? Hate to tell you this: if you need the "unknown bytes scheme at all, this scenerio is extremely likely. How do you think that KOI8-R got into a directory on a Shift-JIS system in the first place? Yup, a Russian visiting professor in Tokyo who set his personal locale to ru_RU.KOI8-R wrote it there. And he's very likely to have the same personal locale on a very up-to-date system with a UTF-8 system encoding when he gets back to Moscow. Bingo! it's mojibake all the way to Moscow.

Now about the "metadata" part which is separate from the filename itself. I have another requirement:

Requirement 5 (no loss of information): I don't want Tahoe to destroy information -- every transformation should be (in principle) reversible by some future computer-augmented archaeologist. For example, if a bytestring decodes cleanly with the locale's suggested encoding, and we use the resulting unicode as the filename, then we also store the original byte string in the metadata since we don't know if the locale's suggested encoding was good.

UTF-8b would be just as good for storing the original bytestring, as long as you keep the original encoding. It's actually probably preferable if PEP 383 can be assumed to be implemented in the versions of Python you use.

This allows the later invention of a tool

It will be called "Emacs", by the way.

which shows the user what the filename would have been with other encodings and let the user choose one that makes sense.

To copy an entry from a local filesystem into Tahoe:

On Windows or Mac read the filename with the unicode APIs. Normalize the string with filename = unicodedata.normalize('NFC', filename). Leave the "original_bytes" key and the "failed_decode" flag out of the metadata.

NFD is probably better for fuzzy matching and display on legacy terminals.

On Linux or Solaris read the filename with the string APIs, and store the result in the "original_bytes" part of the metadata. Call sys.getfilesystemencoding() to get an alleged_encoding. Then, call bytes.decode(alleged_encoding, 'strict') to try to get a unicode object.

2.a. If this decoding succeeds then normalize the unicode filename with filename = unicodedata.normalize('NFC', filename), store the resulting filename and leave the "failed_decode" flag out of the metadata.

Per the koi8-lucky example, you don't know if it succeeded for the right reason or the wrong reason. You really should store the alleged_encoding used in the metadata, always.

Note that you should also store the failed_decode flag, because the presence of multiple fail_decodes is a very strong indication that some of the users had default encoding != system encoding. If you use the scheme I propose above, of course you have the same information by scanning the file name for Tahoe-only private use characters, but that would be relatively expensive.

2.b. If this decoding fails, then we decode it again with bytes.decode('latin-1', 'strict'). Do not normalize it. Store the resulting unicode object into the "filename" part, set the "failed_decode" flag to True. This is mojibake!

Not necessarily. Most ISO-8859/X names will fail to decode if the alleged_encoding is UTF-8, for example, but many (even for X != 1) will be correctly readable because of the policy of trying to share code points across Latin-X encodings. Certainly ISO-8859/1 (and much ISO-8859/15) will be correct.

(handling collisions) In either case 2.a or 2.b the resulting unicode string may already be present in the directory. If so, check the failed_decode flags on the current entry and the new entry. If they are both set or both unset then the new entry overwrites the old entry -- they had the same name.

If both are set, you're OK, because you are forcing ISO-8859/1. If both are unset, however, you don't know for sure because alleged_encoding is not necessarily a constant.

To copy an entry from Tahoe into a local filesystem:

Always use the Python unicode API. The original_bytes field and the failed_decode field in the metadata are not consulted.

Now a question for python-dev people: could utf-8b or PEP 383 be useful for requirements like the four requirements listed above? If not, what requirements does PEP 383 help with?

By giving you a standard, invertible way to represent anything that the OS can throw at you, it helps with all of them.

I'm not sure that it can help if you are going to store the results of your os.listdir() persistently or if you are going to transmit them over a network. Indeed, using the results that way could lead to unpleasant surprises.

No more than any other system for giving a canonical Unicode spelling to the results of an OS call.

Previous message: [Python-Dev] PEP 383 and GUI libraries
Next message: [Python-Dev] PEP 383 and Tahoe [was: GUI libraries]
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list