> What I meant was more like "Humans should not so things that (perhaps > other) > humans cannot understand later." > > And as Doug Ewell said, trying to correct these kind of actions is often > fruitless, and very often misleading if not broken.">

Representing Unix filenames in Unicode (original) (raw)

Next message: Philippe Verdy: "Re: Character delta between Unicode 4.1 and 5.0"


From: "Antoine Leca" <Antoine10646@Leca-Marti.org>
> What I meant was more like "Humans should not so things that (perhaps
> other)
> humans cannot understand later."
>
> And as Doug Ewell said, trying to correct these kind of actions is often
> fruitless, and very often misleading if not broken.

Completey agree. Trying to fix UTF-8 for such thing is bogous at its basic
design becauseit breaks its inherent stability and completeness for its
intended purpose.

What Chris Jacobs and Hans Aberg are trying to defend is a bad design
decision: trying to mix in the same representation two things that belong to
distinct implementation levels. UTF-8 is meant to represent Unicode-encoded
texts. Nothing more.

If you need to represent other kind of data in some text representation, you
need an upper layer protocol on top of UTF-8, but you MUST NOT break UTF-8
itself by relaxing some of its encoding rules. (When doing that, you think
you are creating a bijection, you're wrong, as soon as you admit that there
are exceptions: those unhandled names are even more dangerous in a security
perspective!)

Upper-layer protocols already do exist today, and they do provide a TRUE
(and PROVEN) bijection with ALL possible filenames supported by ALL
filesystems:
* shell escaping syntaxes
* various MIME encodings (including "Quoted-Printable", however I don't like
the way it uses the = sign, as it interacts very badly in shell commands)
* URL encoding syntaxes (notably with the "file:" URI namespace prefix)

I would recommand the third option it for interaction with filesystems,
because it can be degraded cleanly to simpler (and user-friendly) syntaxes
on filenames that do not cause problems, notably file names that are using
strict UTF-8 encoding, in the stable NFC form, not starting by "file:" and
not containing confusable or invisible format control characters. For
filenames that do not respect those conditions, the URL encoding will always
be non confusable.

The third option also interacts cleanly with shell commands under Unix
(inherently allowing escaping more characters that may have special syntaxic
meanings in a shell, such as quotation marks, dollar signs, braces,
pipes...).

For Windows, where the "%" sign as a special meaning in command lines, one
could replace it with "$", and make sure that litteral % and $ signs in
filenames are both URL-encoded ("$" is also used under Unix/Linux shells for
variable substitution, in a way quite comparable to "%" on Windows). Yes it
breaks the normal URL-encoding but this would only create an alternate
URL-encoding form, or one could use another Shell than COMMAND or CMD. But
this would not affect filesystem APIs that would accept URLs instead of
native filenames.



This archive was generated by hypermail 2.1.5: Wed Nov 30 2005 - 09:52:20 CST