(original) (raw)

This has all gotten a bit complicated because everyone has been thinking in terms of actual encodings and actual text files. But I think the use-case here is something different:

A file with a bunch of bytes in it, \_some\_of which are ascii, and the rest are other bytes (maybe binary data, maybe non-ascii-encoded text).

I think this is the use-case that "just worked" in py2, but doesn't in py3 -- i.e. in py3 you have to choose either the binary interpretation or the ascii one, but you can't have both. If you choose ascii, it will barf when you try to decode it, if you choose binary, you lose the ability to do simple stuff with the ascii subset -- parsing, substitution, etc.

Some folks have suggested using latin-1 (or other 8-bit encoding) -- is that guaranteed to work with any binary data, and round-trip accurately?

and will surrogateescape work for arbitrary binary data?

If this is a common need, then it would be nice for py3 to address. I know that I work with a couple file formats that have text headers followed by binary data (not as hard to deal with, but still harder in py3). And from this discussion , it seems that "wire protocols" commonly mix ascii and binary.

So the decisions to be made:

Is this a use-case worth supporting in the standard library?

If so, how?

add some of the basic stuff to the bytes object - i.e. string formatting, what this all started with.

2) create a custom encoding that could losslessly convert to from this mixture to/from a unicode object. I

'm not sure if that is even possible, but it would be kind of cool.

3) create a new object, neither a string nor a bytes object that did what we want (it would look a lot like the py2 string...)

4) create a module for doing the stuff wanted with a bytes object (not very OO)

Does that clarify the discussion at all?

On Thu, Jan 9, 2014 at 2:15 AM, Kristján Valur Jónsson <kristjan@ccpgames.com> wrote:

This is the python 2 program:

with open(fn1) as f1:

with open(fn2, 'w') as f2:

f2.write(process_text(f1.read())

I think the key point here is that this worked because a common case was ascii text and arbitrary binary mixed. As long as all the process_text() stuff is ascii only, that would work, either with arbitrary binary data or ascii-compatible encoding. The fact that it would NOT work with arbitrarily encoded data doesn't mean it's not useful for this special, but perhaps common, case.

--

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax

Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov