[Python-Dev] Bytes path support (original) (raw)
Chris Barker chris.barker at noaa.gov
Fri Aug 22 20:51:20 CEST 2014
- Previous message: [Python-Dev] Bytes path support
- Next message: [Python-Dev] Bytes path support
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman <v+python at g.nevcal.com> wrote:
What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional.
That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex than strict text/binary dichotomy in Python3. First -- we're getting OT here -- this thread was about file and path names, not the contents of files. But I suppose I brought that in when I talked about writing file names to files...
The first I'll mention is the one that follows from my description of what
your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent.
right -- and you would have wanted to open such file in binary mode with py2 as well, but in that case, you's have the contents in py2 string object, which has a few more convenient ways to work with text (at least ascii-compatible) than the py3 bytes object does.
The third is to specify the UTF-8 with the surrogate escape error handler.
This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files.
Just so I'm clear here -- if you write that back out, encoded as utf-8 -- you'll get the exact same binary blob out as came in?
I wonder if this would make it hard to preserve byte boundaries, though.
By the way, IIUC correctly, you can also use the python latin-1 decoder -- anything latin-1 will come through correctly, anything not valid latin-1 will come in as garbage, but if you re-encode with latin-1 the original bytes will be preserved. I think this will also preserve a 1:1 relationship between character count and byte count, which could be handy.
-Chris
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140822/87d8b65a/attachment.html>
- Previous message: [Python-Dev] Bytes path support
- Next message: [Python-Dev] Bytes path support
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]