> "Philippe Verdy" writes: >> If you want to keep the compatibility with null-ended byte strings, >> may be the alternative using really non-character code points might >> help. > > What do you mean by "compatibility with null-ended byte streams"? > > The point in using U+0000 as the escape character is that it does not > appear when filenames are converted to Unicode using pure UTF-8. And > it's the only such code point (unless we count surrogates, but abusing > them would be worse). > > This means that any filename which can be decoded using pure UTF-8, > decodes to the same string using UTF-8-with-escaped-bytes. And any > string which can be encoded into a filename using pure UTF-8 at all > (i.e. consisting only of code points U+0001..U+D7FF or U+E000..U+10FFFF) > encodes to the same string using UTF-8-with-escaped-bytes. > >> Really, you cannot reach a full bijection for those cases: > > Actually it would be possible, but it's hard to design a bijection > with sensible properties like preserving concatenation and preserving > ASCII fragments. > > But I don't need a bijection: it's acceptable when there are Unicode > strings which can't be used as filenames. It's already the case in > pure UTF-8 (due to U+0000 and "/").">

Representing Unix filenames in Unicode (original) (raw)

Next message: Chris Jacobs: "Re: Representing Unix filenames in Unicode"


From: "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
> "Philippe Verdy" <verdy_p@wanadoo.fr> writes:
>> If you want to keep the compatibility with null-ended byte strings,
>> may be the alternative using really non-character code points might
>> help.
>
> What do you mean by "compatibility with null-ended byte streams"?
>
> The point in using U+0000 as the escape character is that it does not
> appear when filenames are converted to Unicode using pure UTF-8. And
> it's the only such code point (unless we count surrogates, but abusing
> them would be worse).
>
> This means that any filename which can be decoded using pure UTF-8,
> decodes to the same string using UTF-8-with-escaped-bytes. And any
> string which can be encoded into a filename using pure UTF-8 at all
> (i.e. consisting only of code points U+0001..U+D7FF or U+E000..U+10FFFF)
> encodes to the same string using UTF-8-with-escaped-bytes.
>
>> Really, you cannot reach a full bijection for those cases:
>
> Actually it would be possible, but it's hard to design a bijection
> with sensible properties like preserving concatenation and preserving
> ASCII fragments.
>
> But I don't need a bijection: it's acceptable when there are Unicode
> strings which can't be used as filenames. It's already the case in
> pure UTF-8 (due to U+0000 and "/").

Note the following defintiion of dirent is already compatible with
extensions that would allow storing meta-data after the filename in the same
dirent entry:

struct dirent {
ino_t d_ino; // inode number in the volume
off_t d_off; // byte offset of the next, non-empty directory entry in
the actual file system directory. (this can be used for seeking into the
directory at absolute positions)
unsigned short d_reclen; // total length of THIS record (including
padding bytes)
char d_name[1]; // variable length data for name, nul byte. Max length
for the name including nul is MAXNAMLEN but does not include possible
alignment padding bytes
};

Note how d_reclen already includes the required terminating nul byte and the
padding bytes. Nothing forbids the filesystem to include more "padding"
bytes and use them to store metadata, such as an indicator for the encoding
with which the filename was created.

Note also that the dirent structure is not the physical one used in UFS (in
UFS the "d_off" field is not stored, and the other fields may be ordered
differently.) The application is not exposed to the physical format of
directory entries.

The OS only provides "d_off" as away to allow seeking at anabsolute position
into the directory file, but the OS states nothing about how d_off is
correlated with d_reclen (so d_off may be a simple counter incremented by 1
between each directory entry, or may be a block number, where each dirent
structure are allocated on the filesystem as an exact multiple of the block
size which is not exposed here. Nothing in this structure also indicates
which encoding is actually used in the underlying filesystem).

So how can this structure be used in applications? Simple: d_reclen is the
total size of the filesystem independant record, including the "d_ino",
"d_off", and "d_reclen" fields, and up to MAXNAMLEN bytes for the
nul-terminated name in d_name. But d_name can be longer than MAXNAMLEN on
actual filesystem. The above structure is never directly used physically.
One can include another field in it to store meta-data info (such as the
encoding of the name in d_name...)

On older kernels, this meta-data field could be a single additional byte
stored in the padding area after the first nul byte in d_name, with a magic
value: for example 08 for UTF-8, given that padding bytes after that first
nul should all be zeroes.

Really, Unix filesystems can be fixed and I don't see why this is not done
so that applications will become aware of that feature (for exemple the
GLIBC could interpret the presence of the magic byte above to know howto
convert unambiguously the dirent entry to the encoding currently set in the
user's POSIX locale, and applications can/should use POSIX functions to
create files under those conventions).



This archive was generated by hypermail 2.1.5: Tue Nov 29 2005 - 12:42:36 CST