(original) (raw)

On 1/12/2014 4:08 PM, Stephen J. Turnbull wrote:

Glenn Linderman writes:

> the proposals to embed binary in Unicode by abusing Latin-1
> encoding.

Those aren't "proposals", they are currently feasible techniques in
Python 3 for *some* use cases.

The question is why infecting Python 3 with the byte/character
confoundance virus is preferable to such techniques, especially if
their (serious!) deficiencies are removed by creating a new type such
as asciistr.

"smuggled binary" (great term borrowed from a different subthread)
muddies the waters of what you are dealing with. As long as the
actual data is only Latin-1 and smuggled binary, the technique
probably isn't too bad... you can define the the "smuggled binary"
as a "decoding" of binary to text, sort of like base64 "decodes"
binary to ASCII. And it can be a useful technique.

As soon as you introduce "smuggled non-ASCII, non-Latin-1 text"
encodings into the mix, it gets thoroughly confusing... just as
confusing as the Python 2 text model. It takes decode+encode to do
the smuggled text, plus encode push it to the boundary, plus you
have text that you know is text, but because of the required
techniques for smuggling it, you can't operate on it or view it
properly as the text that it should be.

The "byte/character confoundance virus" is a hobgoblin of paranoid
perception. In another post, I pointed out that

''' b"%d" % 25 ''' is not equivalent to ''' "%d" % 25 ''' because
of the "b" in the first case. So the "implicit" encoding that
everyone on that side of the fence was talking about was not at all
implicit, but explicit. The numeric characters produced by %d are
clearly in the ASCII subset of text, so having b"%d" % 25 produce
pre-encoded ASCII text is explicit and practical.

My only concern was what b"%s" % 'abc' should do, because in
general, str may not contain only ASCII. (generalize to b"%s" %
str(...) ). Guido solved that one nicely. Of course, at this
point, I could punt the whole argument off to "Guido said so", but
since you asked me, I felt it appropriate to respond from my
perspective... and I'm not sure Guido specifically addressed your
smuggled binary proposal.

When the mixture of text and binary is done as encoded text in
binary, then it is obvious that only limited text processing can be
performed, and getting the text there requires that it was encoded
(hopefully properly encoded per the binary specification being
created) to become binary. And there are no extra, confusing Latin-1
encode/decode operations required.

From a higher-level perspective, I think it would be great to have a
module, perhaps called "boundary" (let's call it that for now), that
allow some definition syntax (augmented BNF? augmented ABNF?) to
explain the format of a binary blob. And then provide methods for
generating and parsing it to/from Python objects. Obviously, the
ABNF couldn't understand Python objects; instead, Python objects
might define the ABNF to which they correspond, and methods for
accepting binary and producing the object (factory method?) and
methods for generating the binary. As objects build upon other
objects, the ABNF to which the correspond could be constructed, and
perhaps even proven to be capable of parsing all valid blobs
corresponding to the specification, and perhaps even proven to be
capable of generating only valid blobs (although I'm not a software
proof guru; last I heard there were definite limits on the ability
to do proofs, but maybe this is a limited enough domain it could
work).

Then all blobs could be operated on sort of like web browsers
operate on the DOM, or some XML parsing libraries, by defining each
blob as a collection of objects for the pieces. XML is far too wordy
for practical use (but hey! it is readable) but perhaps it could be
practical if tokenized, and then the tokenized representation could
be converted to a DOM just like XML and HTML are. (this is mostly to
draw the parallel in the parsing and processing techniques; I'm not
seriously suggesting a binary version of XML, but there is a strong
parallel, and it could be done). Given a DOM-like structure, a
validator could be written to operate on it, though, to provide, if
not a proof, at least a sanity check. And, given the DOM-like
structure, one call to the top-level object to generate the blob
format would walk over all of them, generating the whole blob.

Off I go, drifting into Python ideas.... but I have a program I want
to rewrite that could surely use some of these techniques (and
probably will), because it wants to read several legacy formats, and
produce several legacy formats, as well as a new, more comprehensive
format. So the objects will be required to parse/generate 4
different blob structures, one of which has its own set of several
legacy variations.