(original) (raw)
On Apr 6, 2016 1:26 AM, "Chris Angelico" <rosuav@gmail.com> wrote:
\>
\> On Wed, Apr 6, 2016 at 3:37 PM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
\> > Chris Angelico writes:
\> >
\> > > Outside of deliberate tests, we don't create files on our disks
\> > > whose names are strings of random bytes;
\> >
\> > Wishful thinking. First, names made of control characters have often
\> > been deliberately used by miscreants to conceal their warez. Second,
\> > in some systems it's all too easy to create paths with components in
\> > different locales (the place I've seen it most frequently is in NFS
\> > mounts). I think that's much less true today, but perhaps that's only
\> > because my employer figured out that it was much less pain if system
\> > paths were pure ASCII so that it mostly didn't matter what encoding
\> > users chose for their subtrees.
\>
\> Control characters are still characters, though. You can take a
\> bytestring consisting of byte values less than 32, decode it as UTF-8,
\> and have a series of codepoints to work with.
\>
\> If your employer has "solved" the problem by restricting system paths
\> to ASCII, that's a fine solution for a single system with a single
\> ASCII-compatible encoding; a better solution is to mandate UTF-8 as
\> the file system encoding, as that's what most people are expecting
\> anyway.
\>
\> > It remains important to be able to handle nearly arbitrary bytestrings
\> > in file names as far as I can see. Please note that 100 million
\> > Japanese and 1 billion Chinese by and large still prefer their
\> > homegrown encodings (plural!!) to Unicode, while many systems are now
\> > defaulting filenames to UTF-8\. There's plenty of room remaining for
\> > copying bytestrings to arguments of open and friends.
\>
\> Why exactly do they prefer these other encodings? Are they
\> representing characters that Unicode doesn't contain? If so, we have a
\> fundamental problem (no Python program is going to be able to cope
\> with these, without a third party library or some stupid mess of local
\> code); if not, you can always represent it as Unicode and encode it as
\> UTF-8 when it reaches the file system. Re-encoding is something that's
\> easy when you treat something as text, and impossible when you treat
\> it as bytes.
\>
\> So far, you're still actually agreeing with me: paths are \*text\*, but
\> sometimes we don't know the encoding (and that's a problem to be
\> solved).
re: bytestring, unicode, encodings after e.g. os.path.split / Path.split:
from "\[Python-ideas\] Type hints for text/binary data in Python 2+3 code"
https://mail.python.org/pipermail/python-ideas/2016-March/038869.html
>> would/will it be possible to
use Typing.Text as a base class for even-more abstract string types
https://mail.python.org/pipermail/python-ideas/2016-March/039016.html
>> \* Text.encoding
\>> \* Text.lang (urn:ietf:rfc:3066)
... forgot to CC:
\>> \* https://tools.ietf.org/html/rfc5646
"Tags for Identifying Languages"
urn:ietf:rfc:5646
is this (Path) a narrower case of string types (#strypes), because after transformations we want to preserve string metadata like e.g encoding?
I'd vote for
\* adding DirEntry.\_\_path\_\_ as a proxy to DirEntry.path
\* standardizing on \_\_path\_\_ (over .path)
\* because this operation \*is\* fundamentally similar to e.g. \_\_str\_\_
\* operator.path pathify, pathifize
>
\> ChrisA
\> \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_
\> Python-Dev mailing list
\> Python-Dev@python.org
\> https://mail.python.org/mailman/listinfo/python-dev
\> Unsubscribe: https://mail.python.org/mailman/options/python-dev/wes.turner%40gmail.com