[Python-Dev] casefolding in pathlib (PEP 428) (original) (raw)
Guido van Rossum guido at python.org
Fri Apr 12 00:42:00 CEST 2013
- Previous message: [Python-Dev] casefolding in pathlib (PEP 428)
- Next message: [Python-Dev] casefolding in pathlib (PEP 428)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Apr 11, 2013 at 2:27 PM, Antoine Pitrou <solipsis at pitrou.net> wrote:
On Thu, 11 Apr 2013 14:11:21 -0700 Guido van Rossum <guido at python.org> wrote:
Hey Antoine,
Some of my Dropbox colleagues just drew my attention to the occurrence of case folding in pathlib.py. Basically, case folding as an approach to comparing pathnames is fatally flawed. The issues include: - most OSes these days allow the mounting of both case-sensitive and case-insensitive filesystems simultaneously - the case-folding algorithm on some filesystems is burned into the disk when the disk is formatted The problem is that: - if you always make the comparison case-sensitive, you'll get false negatives - if you make the comparison case-insensitive under Windows, you'll get false positives My assumption was that, globally, the number of false positives in case (2) is much less than the number of false negatives in case (1). On the other hand, one could argue that all comparisons should be case-sensitive and the proper way to test for "identical" paths is to access the filesystem. Which makes me think, perhaps concrete paths should get a "samefile" method as in os.path.samefile(). Hmm, I think I'm tending towards the latter right now.
Python on OSX has been using (1) for a decade now without major problems.
Perhaps it would be best if the code never called lower() or upper() (not even indirectly via os.path.normcase()). Then any case-folding and path-normalization bugs are the responsibility of the application, and we won't have to worry about how to fix the stdlib without breaking backwards compatibility if we ever figure out how to fix this (which I somehow doubt we ever will anyway :-).
Some other issues to be mindful of:
On Linux, paths are really bytes; on Windows (at least NTFS), they are really (16-bit) Unicode; on Mac, they are UTF-8 in a specific normal form (except on some external filesystems).
On Windows, short names are still supported, making the number of ways to spell the path for any given file even larger.
-- --Guido van Rossum (python.org/~guido)
- Previous message: [Python-Dev] casefolding in pathlib (PEP 428)
- Next message: [Python-Dev] casefolding in pathlib (PEP 428)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]