[Python-Dev] PEP 460: allowing %d and %f and mojibake (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Mon Jan 13 11:48:50 CET 2014


Ethan Furman writes:

The part that you don't seem to acknowledge (sorry if I missed it) is that there are str-like methods already on bytes.

I haven't expressed myself well, but I don't much care about that. It's what Knuth would classify as a seminumerical method. What I do care about is that the methods that convert other types to text (including format) not work for bytes. That's where I consider text to "start".

is exactly the Python 2 model of text. But you deny that the effect of your proposals (eg, b"%d" % (12,)) is to reintroduce Python 2's bytes/character confusion, don't you?

Given that the default (and only) text type in Py3 is str, which is unicode, I don't think any confusion will be as severe, but I acknowledge that there could be some.

I fear it will be quite severe where I live, in Shift JIS/GB18030 land. (The two most obnoxious encodings known to man, except perhaps the syntax of Brainf!ck.)

My definition is not ambiguous at all. If this particular part of the byte stream is defined to contain ASCII-encoded text, then I can use the bytes text methods to work with it.

But how is Python supposed to know that?

Python doesn't need to.

... because you know it. But the ideal of object-oriented programming (and duck-typing) is that you shouldn't need to; the object should know how to produce appropriate behavior itself.

But under your definition, you need to make the decision, or explicitly code the decision, on the basis of context.

Exactly so. I even have to do that in Py2.

"Even." This is exactly where PBP and EIBTI part company, I think. EIBTI thinks its a bad idea to pass around bytes that are implicitly some other type, and Python 3 should be good enough to make that unnecessary. I'm convinced, and Nick is convinced, that we can make that true for 90% of the cases that it isn't now, if we could just figure out what's hard about the use cases where Python 3 isn't up to snuff yet (and figure out which use cases we need to handle to get us up to 90%!)

PBP doesn't think it's a great idea to pass around bytes that are implicitly some other type, but didn't mind it (or got used to it) in Python 2, and so they're not looking at that as a problem that Python 3 can solve. They're looking at Python 3 as the problem that prevents them from doing what worked fine in Python 2. I understand that point of view, I just think we should be able to do better in Python 3, and should give it a serious try before giving in. Remember, "Special cases aren't special enough to break the rules" comes before "Although practicality beats purity". Not to forget that "Explicit is better than implicit" is second[1] on the list. ;-)

After looking at this thread, I feel that (due to misunderstandings on both sides) purity hasn't really been tried yet.

If that particular configuration of bytes is because it's ASCII-encoded text, then sure.

Once again, you are advocate precisely the Python 2 model of text.

Not exactly, because what I get back is bytes, which cannot directly be mixed with unicode (str) as it was in Py2. I think this is a key difference.

You're in good company there; that was Guido's rationale for not worrying, too. I agree it's "key" (and I'm sure Nick will, on reflection if not already). But I worry (a lot) that it's not enough.

This confuses me somewhat. It's okay to use b'ethan'.upper(), which only makes semantic sense as ASCII-encoded text,

Not really OK. In theory, because it doesn't require serialization/ encoding of a primitive type, it doesn't matter. In practice, without powerful formatting, it isn't even a major attraction. In practice, with powerful formatting, it adds to the attraction.

Note that regex doesn't require type conversions (matches have methods to return positions in the target or subsequences of the target, not values of other types), which is why I (and I suspect Nick for the same reason) am comfortable with polymorphic regex but not with bytes formatting.

(Aside, I'm perfectly comfortable with "ASCII-encoded text" because if you took u'ethan'.encode('ascii') you would get b'ethan'. If it was some other encoding, such as cp1251, I would call that particular byte stream "cp1251-encoded text".

Even though "ethan" is perfectly good ASCII-encoded text (as well as the integer 435,744,694,638 on a bigendian machine with 5-byte words, and you have no way of knowing whether it was user data (CP1251) or a metadata keyword (ASCII) or be the US national debt in 1967 dollars (integer) when b'ethan' shows up in a trace?

And if there were methods that worked directly on a cp1251-encoded byte stream I would not have any problem using them on cp1251-encoded text.)

I was afraid of that: all of those methods (except the case methods[2]) will work fine on a cp1251-encoded text. And because they only know that the string is bytes, the case methods will silently corrupt your "text" as soon as they get a chance. That bothers me, even if it doesn't bother you. Purity again, if you like. (But you'd take a safe .upper if you got it for free, no?)

Okay, I've thought somewhat. Under the definition above would it be fair to say that Db3Table (a class in my dbf module) is a boundary type? It sits between the actual file and the program, and transforms bytes into actual Python types.

Yes, I'd call that a boundary type.

Footnotes: [1] Yes, I know what's number 1, but I'm not going to mention it out loud!

[2] Arguably those too, since bytes don't have a locale. They're in C locale and the bytes >127 don't have semantics like case.



More information about the Python-Dev mailing list