[Python-Dev] Multilingual programming article on the Red Hat Developer blog (original) (raw)

Stephen J. Turnbull stephen at xemacs.org
Wed Sep 17 05:28:57 CEST 2014


Glenn Linderman writes:

Some bytes may decode into characters without needing to be smuggled... maybe not in text-protocols like email, but in the general case. So then some of the bytes that should be interpreted as binary data are not in a disjoint set from characters.

True, but irrelevant. The point is that whoever chose the codec is responsible for getting it right, not only the right encoding, but for the assumption that the input data was pure encoded text. The rest of the program can now assume that choice was made correctly, and process text as text. The program cannot be blamed for assuming that the person who chose the codec knew what they were about, and so characters can be assumed to be decoded from bytes representing characters.

This was not true in Python 2, where it was common practice to represent encoded text by itself internally, implicitly assuming that only one encoding would be encountered in each invocation of the program. This was never true, and with the spread of the Internet and then the WWW, it became a major issue. And that's why we invented Python 3, to let text be text without the encumbrance of always being aware of encodings and converting when different encodings collide, etc.



More information about the Python-Dev mailing list