[Python-Dev] PEP 460: allowing %d and %f and mojibake (original) (raw)

Glenn Linderman v+python at g.nevcal.com
Tue Jan 14 07:01:35 CET 2014


On 1/13/2014 8:58 PM, Stephen J. Turnbull wrote:

Glenn Linderman writes: > On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote: >> Glenn Linderman writes:

>>> "smuggled binary" (great term borrowed from a different >>> subthread) muddies the waters of what you are dealing with. >> Not really. The "mud" is one or more of the serious deficiencies. >> It can be removed, I believe (and Nick apparently does, too). >> "asciistr" is one way to try that. > Yes really. Use of smuggled binary means the str containing it can > no longer be treated completely as a str. That is "muddier" than > having a str that is only a str. You don't seem to understand what asciistr is: it's a *different type* that is simultaneously compatible in operation with bytes and str, by automatically converting to whichever it is used with. If we used asciistr, str would no longer be muddy (except in cases where we would have used surrogateescape anyway).

No, I haven't fully understood what asciistr is, only Nick's several descriptions of it.

I do understand it is a different type, and can interact with both bytes and str.

If it automatically converts, then it sounds terribly inefficient with long data, but I didn't hear Nick say that, but maybe I missed it.

You mentioned asciistr in the snippet above, but most of what you have been writing about smuggled binary was using str... I hadn't grokked that you were now a full-fledged proponent of asciistr, and were now proposing to put your smuggled binary into asciistr.

You also don't seem to understand that bytes are conceptually pure mud. Anything that is pushed to bytes because you don't know what type it is (or because at the time the program is written, the type can't be known) is no longer subject to duck-typing.

If you are talking str, then bytes are mud. If you are talking bytes, then str is mud.

I'm wouldn't think of "pushing something to bytes" (whatever that means) because I don't know what it is... I may manipulate bytes because I know what they are, and that is the most appropriate form for that piece of data for the present manipulations; if something is text, I want to transform the bytes to str if I need to manipulate it, parse it, or present it. If I don't know what something is, it is because it didn't meet my expectations of what it should be, and I want to present an error, which may include some representation (probably hex) of some of the bytes that cannot be understood.

But if I'm "pushing to bytes", which I would interpret as creating a byte stream, then I know what I have, and I need to convert it to bytes either to store it in a file, or communicate it to another process. That's far from not knowing what it is.

So the question is "how is mud best handled?" Obviously, incorporating it in str with .decode('latin1') is inappropriate.

Glad to hear you say that; I thought that was what you were promoting, when you said, in an earlier message:

On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:

Glenn Linderman writes:

> the proposals to embed binary in Unicode by abusing Latin-1 > encoding. Those aren't "proposals", they are currently feasible techniques in Python 3 forsome use cases.

Back to this one, though.

However, if you use .decode('ascii') you have your choice of error handlers. If you use errors='strict' then no mud can get in. Use of any other error handler is obviously a "consenting adults" behavior; it should only be done when you expect that you can keep the muddy str from leaking into places where it might be passed to an I/O function. (Note that the internal processing of an application that never outputs such a str is completely conformant to the Unicode Standard. That's not a goal of Python, since surrogateescape is designed to be used on output too. But if the developer applies that standard to each program component, he's going to be in pretty good shape.)

If you use asciistr, then you're pretty much in complete control. The exception is operations that munge individual characters (case conversion). If you have a protocol with ASCII keywords but their case is specified, you'll need to define another type to remove the case-munging methods if you want that level of safety.

The above doesn't sound like a use case I care about, much. If I get a garbled file without an accurate definition of what it contains, then I probably want to stick it in the trash. The only "processing" that can be done is to pass on the garbage to someone else, and stink up their system, and that can be done purely as bytes.

If, as in your proposal, bytes are tagged with descriptions, you are effectively creating types on the fly. But if the program doesn't anticipate that, they're mud.

Interpreting a file format or wire protocol requires parsing and manipulating an incoming byte stream, and converting it to useful types in the program... if it can't be converted to useful types, then why bother parsing it? So the rest of my discussion was not talking about creating types on the fly, but on a systematic way of converting a well-specified byte stream (file format, or wire protocol) to a collection of useful types, in an organized manner, that might be verifiable, rather than with ad-hoc coding. And similarly in reverse... after manipulating the objects to perform useful transformations, possibly based on user input (that's what a program does), then to write them back out to a byte stream in modified form, in an organized manner, that might be verifiable, rather than with ad-hoc coding.

If the program doesn't anticipate all of them those descriptions that are unhandled become mud, too. ITSM that the "syntax descriptor" feature is already present in Python, and it's called "class". So, IMHO, simply converting to an appropriate Python type on input is what should be done, but in any case, I don't see how adding a "syntax descriptor" attribute to bytes is going to improve the situation significantly.

Syntax descriptors would be a description of the substructures of a file format (think TIFF files) or wire protocol, and might allow parsing of binary files similarly to the way computer languages are parsed, producing errors when encountering mud. What you dismiss as "converting to an appropriate Python type on input" can be quite complex when for complex file formats, but it is the process of converting to such a heirarchy of Python objects that was to be described by the syntax descriptors.

Note that such a class can postpone parsing for efficiency or lack of information reasons, and store the object as bytes until needed. But this is not the same as passing around naked bytes, because the class can ensure that bytes can't get out, only parsed objects.

Sure, it could. My proposal is suggesting that the distribution of bytes to objects in a hierarchy might be automated in the sense of parsing the binary format, so that instead of writing "a class" for the whole, that class would be pre-written, based on the syntax description of the file, and matching that with the syntax descriptions of the component types. It is really a topic for python ideas, to flesh it out further, but it seemed related, as a use case, a class that would live on the bytes processing boundary, producing other objects, some of which may be text strings, in an organized, probably hierarchical, collection of objects.

-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140113/060b864e/attachment.html>



More information about the Python-Dev mailing list