[Python-Dev] Which direction is UnTransform? (original) (raw)

[Python-Dev] Which direction is UnTransform? / Unicode is different

Steven D'Aprano steve at pearwood.info
Wed Nov 20 13:03:03 CET 2013

Previous message: [Python-Dev] Which direction is UnTransform? / Unicode is different
Next message: [Python-Dev] Which direction is UnTransform? / Unicode is different
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Nov 19, 2013 at 05:28:48PM -0800, Jim J. Jewett wrote:

(Fri Nov 15 16:57:00 CET 2013) Stephen J. Turnbull wrote: > Serhiy Storchaka wrote: > > If the transform() method will be added, I prefer to have only > > one transformation method and specify a direction by the > > transformation name ("bzip2"/"unbzip2"). Me too. Until I consider special cases like "compress", or "lower", and realize that there are enough special cases to become a major wart if generic transforms ever became popular.

I'm not sure I understand this comment. Why are "compress" and "lower" special cases? If there's a "compress" codec, presumably there'll be an "uncompress" or "expand" that reverses it. In the case of "lower", it's not losslessly reversable, but there's certainly a reverse transformation, "upper".

Some transformations are their own reverse, e.g. "rot13". In that case, there's no need for an unrot13 codec, since applying it twice undoes it.

> People think about these transformations as "en- or de-coding", not > "transforming", most of the time. Even for a transformation that is > an involution (eg, rot13), people have an very clear idea of what's > encoded and what's not, and they are going to prefer the names > "encode" and "decode" for these (generic) operations in many cases.

I think this is one of the major stumbling blocks with unicode. I originally disagreed strongly with what Stephen wrote -- but then I realized that all my counterexamples involved unicode text.

Counterexamples to what? Again, I'm afraid I can't really understand what point you're trying to make here. Perhaps an explicit counterexample, and an explicit statement of what you're disagreeing with (e.g. "I disagree that people have a clear example of what's encoded and what's not") will help.

[...]

But an 8-bit (even Latin-1, let alone ASCII) bytestring really doesn't seem "encoded", and it doesn't make sense to "decode" a perfectly readable (ASCII) string into a sequence of "code units".

Of course it is encoded. There's nothing "a"-like about the byte 0x61, byte 0x2E is nothing like a period, and there is nothing about the byte 0x0A that forces text editors to start a new line -- or should that be 0x0D, or even possibly 0x85?

There's nothing that distinguishes the text "spam" from the four-byte integer 1936744813 (0x7370616d in hex) except the semantics that we grant it, and that includes an implicit transformation 0x73 <-> "s", etc.

Reading this may help:

www.joelonsoftware.com/articles/Unicode.html‎

Nor does it help that http://www.unicode.org/glossary/#codeunit defines "code unit" as "The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D77 in Section 3.9, Unicode Encoding Forms.)"

I agree that the official Unicode glossary is unfortunately confusing. It has a huge amount of information, often with confusingly similar terminology (code points and code units are, in a sense, opposites), and it's quite hard for beginners to Unicode to make sense of it all.

I have to read that very carefully to avoid mentally translating it into "Code Units are encoded,

Code units are encoded, in the sense that we say a burger is cooked. Take a raw meat patty and cook it, and you get a burger. Similarly, code units are the product of an encoding process, hence have been encoded.

Code points (think of them as characters, modulo a few technicalities) are encoded into code units, which are bytes. Which code units you get depend on the encoding form you use, i.e. the codec.

If you start with the character "a", and apply the UTF-8 encoding, you get a single 8-bit (one byte) code unit, 0x61. If you apply the UTF-16 (big endian) encoding, you get a single 16-bit (two bytes) code unit, 0x0061. If you apply UTF-32be codec, you get a single 32-bit (four bytes) code unit, 0x00000061.

and there are lots of different complicated encodings that I wouldn't use unless I were doing special processing or interchange."

Very few of those encodings are Unicode. With the exception of a small handful of UTF-* codecs, and maybe one or two others, the vast majority are legacy encodings from the Bad Old Days when just about every computer had it's own distinct character set, or sets. If you're a Windows user, the non-UTF codecs (all the Latin-whatever codecs, Big5, cp-whatever, koi8-whatever, there are dozens of them) are basically old Windows code pages and the equivalent from other computer systems.

And yes, it is best to avoid them like the plague except when you need them for interoperability with legacy data.

If I'm not using the network, or if my "interchange format" already looks like readable ASCII, then unicode sure sounds like a complication.

It's not, not compared to the Bad Old Days. If you're like me, you remember when you couldn't exchange text files from Macintosh to Windows and visa versa without data being corrupted. Now, so long as both sides use Unicode, data corruption ought to be a thing of the past.

It's not, but only because some operating systems still insist on using non-Unicode encodings by default.

I will get confused over which direction is encoding and which is decoding. (Removing .decode() from the (unicode) str type in 3 does help a lot, if I have a Python 3 interpreter running to check against.)

It took me a long time to learn that text encodes to bytes, and bytes decode back to text. Using Python 3 really helped with that.

-- Steven

Previous message: [Python-Dev] Which direction is UnTransform? / Unicode is different
Next message: [Python-Dev] Which direction is UnTransform? / Unicode is different
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Python-Dev mailing list