[Python-Dev] PEP 460: allowing %d and %f and mojibake (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Tue Jan 14 05:58:56 CET 2014

Previous message: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Next message: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Glenn Linderman writes:

On 1/13/2014 6:43 AM, Stephen J. Turnbull wrote:

Glenn Linderman writes:

"smuggled binary" (great term borrowed from a different subthread) muddies the waters of what you are dealing with.

Not really. The "mud" is one or more of the serious deficiencies. It can be removed, I believe (and Nick apparently does, too). "asciistr" is one way to try that.

Yes really. Use of smuggled binary means the str containing it can no longer be treated completely as a str. That is "muddier" than having a str that is only a str.

You don't seem to understand what asciistr is: it's a different type that is simultaneously compatible in operation with bytes and str, by automatically converting to whichever it is used with. If we used asciistr, str would no longer be muddy (except in cases where we would have used surrogateescape anyway).

You also don't seem to understand that bytes are conceptually pure mud. Anything that is pushed to bytes because you don't know what type it is (or because at the time the program is written, the type can't be known) is no longer subject to duck-typing.

So the question is "how is mud best handled?" Obviously, incorporating it in str with .decode('latin1') is inappropriate. However, if you use .decode('ascii') you have your choice of error handlers. If you use errors='strict' then no mud can get in. Use of any other error handler is obviously a "consenting adults" behavior; it should only be done when you expect that you can keep the muddy str from leaking into places where it might be passed to an I/O function. (Note that the internal processing of an application that never outputs such a str is completely conformant to the Unicode Standard. That's not a goal of Python, since surrogateescape is designed to be used on output too. But if the developer applies that standard to each program component, he's going to be in pretty good shape.)

If you use asciistr, then you're pretty much in complete control. The exception is operations that munge individual characters (case conversion). If you have a protocol with ASCII keywords but their case is specified, you'll need to define another type to remove the case-munging methods if you want that level of safety.

If, as in your proposal, bytes are tagged with descriptions, you are effectively creating types on the fly. But if the program doesn't anticipate that, they're mud. If the program doesn't anticipate all of them those descriptions that are unhandled become mud, too. ITSM that the "syntax descriptor" feature is already present in Python, and it's called "class". So, IMHO, simply converting to an appropriate Python type on input is what should be done, but in any case, I don't see how adding a "syntax descriptor" attribute to bytes is going to improve the situation significantly.

Note that such a class can postpone parsing for efficiency or lack of information reasons, and store the object as bytes until needed. But this is not the same as passing around naked bytes, because the class can ensure that bytes can't get out, only parsed objects.

Previous message: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Next message: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list