[Python-Dev] PEP 460: allowing %d and %f and mojibake (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Mon Jan 13 04:02:05 CET 2014

Previous message: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Next message: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Ethan Furman writes:

Are you saying it's okay to be insulting when frustrated? I also find this mega-thread frustrating, but I'm trying very hard not to be insulting.

OK, no. Understandable, yes.

If you are going to use my name, please be certain of the facts [1]. More below.

MAL posted straight out the Python 2 model of text makes it easier for him to write some programs, so he's all for reintroducing it. And that is the whole truth of the matter. Although I disagree with him, I appreciate his honesty.

If you have an example of me lying (even if it's just a possibility), please refer to it directly so I can either try to explain the misunderstanding or apologize.

Praising one person for honesty doesn't imply anybody else is lying.

As for the Artist Currently Posting as Ethan Furman, he's not in the "disingenous" group. I don't think you understand the issues at stake (among other things, as I've discussed elsewhere, I think your use case is different from the use cases of most of those who are asking for bytes formatting). And there's a crucial terminology difference:

In only one case did I use the word "text" loosely,

From my point of view, you consistently do so. Bytes are never Python 3 text in my terminology, and I think that is generally accepted on these channels. "ASCII-encoded text" as you call it (and repeatedly do so), and want to manipulate using str-like methods on bytes, is exactly the Python 2 model of text. But you deny that the effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python 2's bytes/character confusion, don't you?

Yes, I've used "ASCII-compatible text" in some of my posts, but I recognize that as "loose usage", too, and would stop if requested. Note I'm not asking you to stop -- I think we all understand what you mean, even though for some of us it's loose terminology. What I do hope you will recognize is that adding str-like methods to bytes is precisely the Python 2 model of text processing[1], and that like MAL you will say, "OK, I don't see a problem with reintroducing Python 2's byte/character confusion." (Well, I really want you to see the light, and retract your proposal for b'%d' format. But that hardly seems likely. :-)

But don't lie to me (as Nick tried to) and say that "In particular, the bytes type is, and always will be, designed for pure binary manipulation" when it has methods like .center().

I hardly think Nick is lying, any more than you are. AFAICT, you're both wrong. According to PEP 3137[2] by Guido van Rossum, the idea of the immutable bytes type was suggested (in various aspects which combined to overcome Guido's initial opposition) by Gregory P. Smith, Jeffrey Yasskin, and Talin. Guido then chose to implement it by grabbing the Python 2 code, and removing .encode, and removing locale-dependent definitions of character classes. This was with a view to supporting ports of code that implements wire protocols or uses bytes as encoded text:

It also makes it possible to efficiently create hash tables using
bytes for keys; this may be useful when parsing protocols like
HTTP or SMTP which are based on bytes representing text.

Porting code that manipulates binary data (or encoded text) in
Python 2.x will be easier using the new design than using the
original 3.0 design with mutable bytes; simply replace str with
bytes and change '...' literals into b'...' literals.

IIRC, only later was regex support added to bytes (by Nick himself, again IIRC). And despite the quote above, I don't think Guido meant to encourage use of bytes as text in wire protocol development, at least not at that time.

Note that Nick has already admitted that permitting even methods that can be implemented purely as numerical manipulations:

def is_uppercase(b):
    # Note all comparisons are between integers:
    return ord('A') <= b[0] and b[0] <= ord('Z')

was in retrospect a mistake (in his opinion). So I don't think it was a lie, merely a difference in your definitions of "pure binary manipulation". (Which isn't surprising, given that ultimately everything in computers as we know them today eventually reduces to "pure binary manipulations".[3] Drawing the line is going to involve personal taste to some extent.) I think his interpretation that bytes were designed that way is a bit strained given PEP 3137. I also don't know what was discussed at language summits, and don't recall the python-dev conversations about it at all.

A final remark: Be very careful in interpreting Guido's words in these "practical vs. pure" matters. I've discovered his offhand comments on these matters are often both subtle and deep (that probably doesn't surprise you), and that the idea behind them is usually extremely precise though his expression may informal or even casual (and here be dragons -- taking the expression too literally may lead you astray).

I think some of the misunderstanding (which you also seem to suffer from) is that we (or at least I) /ever/ want a unicode string back from bytes interpolation. I don't!

Please tell me why you think I suffer from that misunderstanding. I certainly don't think you want Unicode strings. You've been quite strident about the fact that you don' need no steekin' yooneekode (for these purposes).

What I want to find out is why your use case can't be handled with Python 3 str. That's why I provide examples (mostly parallel to yours) that return str in Python 3 (I can't speak for anyone else).

To summarize, I used the term text when referring to unicode text (str), ASCII or ASCII-encoded text to refer to bytes that are to be used in a place that requires ASCII bytes for communication (such as content length or field type).

I've never been confused about that, but your use of the word "text" in a way differently from others in the thread seems to confuse you about what they mean.

But did you get that I'm worried that programmers in Omaha will use that same functionality to communicate American English (for which it is basically sufficient, and which also requires ASCII when bytes are used for communication)?

My definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it.

But how is Python supposed to know that? The point of having types in a programming language is so that either the interpreter can just DTRT, or raise an exception if TRT is ambiguous, without explicit specification by the programmer. This is precisely what asciistr is for: it knows that it is both unicode and bytes compatible, and morphs automatically to whichever it is combined with. And does so efficiently (because they're all immutable, any combination of these types in Python involves copying "code units", and for asciistr that copy is always of bytes, thus reducing eventually to memcpy for bytes and latin1-only str).

But under your definition, you need to make the decision, or explicitly code the decision, on the basis of context.

When it's convenient for them to use text-processing operations on bytes, they'll say "oh, yes, these are conventionally considered text-processing features, but that's just an accident of the particular configuration of bytes -- yup, bytes -- I'm processing."

If that particular configuration of bytes is because it's ASCII-encoded text, then sure.

Once again, you are advocate precisely the Python 2 model of text.

To use, for example, bytes.upper on data that wasn't ASCII-encoded text (even if it happened to look like it was) would be the height of stupidity. Please don't include me in such accusations.

I have no idea why you think I think anybody would be that stupid. That never occured to me. It's precisely "magic numbers" that happen to look like English words when interpreted as ASCII coded characters that I don't want manipulated by str-like methods that interpret text (such as full-featured format or %).

If b"Content-Length: 123" is (ASCII-encoded) text, then it should be created as, or decoded to, internal text and handled that way. If it's binary, then handle it as binary.

ambiguous form". IMO, with the proposed changes, that is likely to continue indefinitely, negating some of the gains I expected to receive from Python 3. :-(

This would be a good reason to reject PEP 460, if that danger was deemed more likely than the good it would bring.

Depends on which version. I earlier opposed PEP 460 in any form, but I'm persuaded by Nick's particular definition of "pure binary manipulation" and agree that PEP 460 as revised by Antoine is harmless to my goals. Although I personally am unlikely to find any great convenience from it (both as a matter of style and to a great extent a lack of use cases, although I'd like to get involved in the email module).

Note: there are a lot of high-level frameworks like Django that even in Python 2 basically went to Unicode everywhere internally. I don't deny that. I think that Python 3 as currently constituted makes it a lot easier to make an appropriate decision of where to convert, and should take some of the burden off the high-level frameworks. Approving this PEP, especially in a maximalist form, will blur the lines.

I understand your point, but I disagree. When I open a file (in binary mode, obviously, as otherwise I'd get massive corruption)

Obviously, you would open the file in binary mode, but by definition of the latin1 codec and the surrogateescape handler, I can definitely avoid any corruption when reading such files as text. (This may require painful contortions if one does any nontrivial processing, but then again it may not.)

I get back a bunch of bytes. When working with tcp, I get back a bunch of bytes. bytes are /already/ the boundary type.

No, they are not. Clearly there are "just bytes" on the "outside" of I/O in each of your examples here, and they are "just copied" to the inside of Python. But in Nick's sense, this is the "outside," not the "inside", of your program! On the "inside", you want "a bool, an int, a float, a date, or, even, a str" (I'm quoting!). What Nick means by a "boundary type" is a type that works seamlessly with the types on each side of the boundary as a helper in the conversion. So when you use a struct to pack a bool, an int, and a date into a bytes, the struct is the boundary type. And if there's a helper type to work with bytes and/or str simultaneously, that's a boundary type, eg, asciistr. But bytes itself is not a boundary type, it's just a type with no internal structure, not even characters.

If we have to make a third type for proper boundary processing it's an admission that bytes failed in its role.

That admission was made in PEP 3100.

Or, more precisely, bytes was never considered as a boundary type in Python 3.

Footnotes: [1] To be precise, one of two models, the other one being the unicode type.

[2] http://www.python.org/dev/peps/pep-3137/

[3] OK, OK, I still have my Daddy's K&E loglog slide rule. Not everything is binary!

Previous message: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Next message: [Python-Dev] PEP 460: allowing %d and %f and mojibake
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list