[Python-Dev] Filename as byte string in python 2.6 or 3.0? (original) (raw)
M.-A. Lemburg mal at egenix.com
Mon Sep 29 13:16:11 CEST 2008
- Previous message: [Python-Dev] Filename as byte string in python 2.6 or 3.0?
- Next message: [Python-Dev] Filename as byte string in python 2.6 or 3.0?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On 2008-09-29 12:50, Ulrich Eckhardt wrote:
On Sunday 28 September 2008, Gregory P. Smith wrote:
"broken" systems will always exist. Code to deal with them must be possible to write in python 3.0.
since any given path (not just fs) can have its own encoding it makes the most sense to me to let the OS deal with the errors and not try to enforce bytes vs string encoding type at the python lib. level. Actually I'm afraid that that isn't really useful. I, too, would like to kick peoples' back in order to get the to fix their systems or use the proper codepage while mounting etc, etc, but that is not going to happen soon. Just ignoring those broken systems is tempting, but alienating a large group of users isn't IMHO worth it. Instead, I'd like to present a different approach: 1. For POSIX platforms (using a byte string for the path): Here, the first approach is to convert the path to Unicode, according to the locale's CTYPE category. Hopefully, it will be UTF-8, but also codepages should work. If there is a segment (a byte sequence between two path separators) where it doesn't work, it uses an ASCII mapping where possible and codepoints from the "Private Use Area" (PUA) of Unicode for the non-decodable bytes. In order to pass this path to fopen(), each segment would be converted to a byte string again, using the locale's CTYPE category except for segments which use the PUA where it simply encodes the original bytes.
I'm not sure how this would work. How would you map the private use code points back to bytes ? Using a special codec that knows about these code points ? How would the fopen() know to use that special codec instead of e.g. the UTF-8 codec ?
BTW: Private use areas in Unicode are meant for e.g. company specific code points. Using them for escaping purposes is likely to cause problems due to assignment clashes.
Regarding the subject of file names:
On Unix, it's well possible to have to deal with 2-3 different file systems mounted on a machine. Each of those may use a different file name encoding or not support file name encoding at all.
If the OS doesn't guarantee a consistent file name encoding, then why should Python try to emulate this on top of the OS ?
I think it's more important to be able to open a file, than to have a readable file name when printing it to stdout, e.g. I wouldn't be able to tell whether some Chinese file name makes sense or not, but if I know that all files in a directory are meant for processing I should be able to iterate over them regardless of whether they make sense or not.
2. For win32 platforms, the path is already Unicode (UTF-16) and the whole problem is solved or not solved by the OS.
In the end, both approaches yield a path represented by a Unicode string for intermediate use, which provides maximum flexibility. Further, it preserves "broken" encodings by simply mapping their byte-values to the PUA of Unicode. Maybe not using a string to represent a path would be a good idea, too. At least it would make it very clear that the string is not completely free-form.
-- Marc-Andre Lemburg eGenix.com
Professional Python Services directly from the Source (#1, Sep 29 2008)
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
:::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611
- Previous message: [Python-Dev] Filename as byte string in python 2.6 or 3.0?
- Next message: [Python-Dev] Filename as byte string in python 2.6 or 3.0?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]