[Python-Dev] Bytes path support (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Fri Aug 22 22:17:44 CEST 2014

Previous message: [Python-Dev] Bytes path support
Next message: [Python-Dev] Bytes path support
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 8/22/2014 11:50 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:

On 8/22/2014 9:52 AM, Oleg Broytman wrote:

On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:

On 8/22/2014 8:51 AM, Oleg Broytman wrote:

What encoding does have a text file (an HTML, to be precise) with text in utf-8, ads in cp1251 (ad blocks were included from different files) and comments in koi8-r? Well, I must admit the HTML was rather an exception, but having a text file with some strange characters (binary strings, or paragraphs in different encodings) is not that exceptional. That's not a text file. That's a binary file containing (hopefully delimited, and documented) sections of encoded text in different encodings. Allow me to disagree. For me, this is a text file which I can (and do) view with a pager, edit with a text editor, list on a console, search with grep and so on. If it is not a text file by strict Python3 standards then these standards are too strict for me. Either I find a simple workaround in Python3 to work with such texts or find a different tool. I cannot avoid such files because my reality is much more complex than strict text/binary dichotomy in Python3. I was not declaring your file not to be a "text file" from any definition obtained from Python3 documentation, just from a common sense definition of "text file". And in my opinion those files are perfect text. The files consist of lines separated by EOL characters (not necessary EOL characters of my OS because it could be a text file produced in a different OS), lines consist of words and words of characters.

Until you know or can deduce the encoding of a file, it is binary. If it has multiple, different, embedded encodings of text, it is still binary. In my opinion. So these are just opinions, and naming conventions. If you call it text, you have a different definition of text file than I do.

Looking at it from Python3, though, it is clear that when opening a file in "text" mode, an encoding may be specified or will be assumed. That is one encoding, applying to the whole file, not 3 encodings, with declarations on when to switch between them. So I think, in general, Python3 assumes or defines a definition of text file that matches my "common sense" definition. I don't have problems with Python3 text. I have problems with Python3 trying to get rid of byte strings and treating bytes as strict non-text.

Python3 is not trying to get rid of byte strings. But to some extent, it is wanting to treat bytes as non-text... bytes can be encoded text, but is not text until it is decoded. There is some processing that can be done on encoded text, but it has to be done differently (in many cases) than processing done on (non-encoded) text.

One difference is the interpretation of what character is what varies from encoding to encoding, so if the processing requires understanding the characters, then the character code must be known.

On the other hand, if it suffices to detect blocks of opaque text delimited by a known set of delimiters codes (EOL: CR, LF, combinations thereof) then that can be done relatively easily on binary, as long as the encoding doesn't have data puns where a multibyte encoded character might contain the code for the delimiter as one of the bytes of the code for the character.

On the other hand, Python3 provides various facilities for working with such files.

The first I'll mention is the one that follows from my description of what your file really is: Python3 allows opening files in binary mode, and then decoding various sections of it using whatever encoding you like, using the bytes.decode() operation on various sections of the file. Determination of which sections are in which encodings is beyond the scope of this description of the technique, and is application dependent. This is perhaps the most promising approach. If I can open a text file in binary mode, iterate it line by line, split every line of non-ascii bytes with .split() and process them that'd satisfy my needs. But still there are dragons. If I read a filename from such file I read it as bytes, not str, so I can only use low-level APIs to manipulate with those filenames. Pity.

If the file names are in an unknown encoding, both in the directory and in the encoded text in the file listing, then unless you can deduce the encoding, you would be limited to doing manipulations with file APIs that support bytes, the low-level ones, yes. If you can deduce the encoding, then you are freed from that limitation.

Let see a perfectly normal situation I am quite often in. A person sent me a directory full of MP3 files. The transport doesn't matter; it could be FTP, or rsync, or a zip file sent by email, or bittorrent. What matters is that filenames and content are in alien encodings. Most often it's cp1251 (the encoding used in Russian Windows) but can be koi8 or utf8. There is a playlist among the files -- a text file that lists MP3 files, every file on a single line; usually with full paths ("C:\Audio\some.mp3"). Now I want to read filenames from the file and process the filenames (strip paths) and files (verify existing of files, or renumber the files or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are also in cp1251 of utf-8 encoding]...whatever).

"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is utf-8, but it is not both. Maybe you meant "or" instead of "of".

I don't know the encoding of the playlist but I know it corresponds to the encoding of filenames so I can expect those files exist on my filesystem; they have strangely looking unreadable names but they exist. Just a small example of why I do want to process filenames from a text file in an alien encoding. Without knowing the encoding in advance.

An interesting example, for sure. Life will be easier when everyone converts to Unicode and UTF-8.

The second is to specify an error handler, that, like you, is trained to recognize the other encodings and convert them appropriately. I'm not aware that such an error handler has been or could be written, myself not having your training.

The third is to specify the UTF-8 with the surrogate escape error handler. This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as smart as you, could perhaps be developed to detect and manipulate the resulting "lone surrogate" codes in meaningful ways, or could simply allow them to ride along without interpretation, and be emitted as the original, into other files. Yes, these are different workarounds. Oleg.

-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140822/2cf650c5/attachment.html>

Previous message: [Python-Dev] Bytes path support
Next message: [Python-Dev] Bytes path support
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list