Information Theory (original) (raw)

Last update: 21 Apr 2025 21:17
First version:

Imagine that someone hands you a sealed envelope, containing, say, a telegram. You want to know what the message is, but you can't just open it up and read it. Instead you have to play a game with the messenger: you get to ask yes-or-no questions about the contents of the envelope, to which he'll respond truthfully. Question: assuming this rather contrived and boring exercise is repeated many times over, and you get as clever at choosing your questions as possible, what's the smallest number of questions needed, on average, to get the contents of the message nailed down?

This question actually has an answer. Suppose there are only a finite number of messages ("Yes"; "No"; "Marry me?"; "In Reno, divorce final"; "All is known stop fly at once stop"; or just that there's a limit on the length of the messages, say a thousand characters). Then we can number the messages from 1 to \( N \). Call the message we get on this trial \( S \). Since the game is repeated many times, it makes sense to say that there's a probability \( p_i \) of getting message number \( i \) on any given trial, i.e. \( \Pr{(S=i)} = p_i \). Now, the number of yes-no questions needed to pick out any given message is, at most, \( \log{N} \), taking the logarithm to base two. (If you were allowed to ask questions with three possible answers, it'd be log to the base three. Natural logarithms would seem to imply the idea of their being 2.718... answers per question, but nonetheless make sense mathematically.) But one can do better than that: if message \( i \) is more frequent than message \( j \) (if \( p_i > p_j \) ), it makes sense to ask whether the message is \( i \) before considering the possibility that it's \( j \); you'll save time. One can in fact show, with a bit of algebra, that the smallest average number of yes-no questions is \[ -\sum_{i}{p_i\log{p_i}} ~. \] This gives us \( \log{N} \) when all the \( p_i \) are equal, which makes sense: then there are no prefered messages, and the order of asking doesn't make any difference. The sum is called, variously, the information, the information content, the self-information, the entropy or the Shannon entropy of the message, conventionally written \( H[S] \).

Now, at this point a natural and sound reaction would be to say "the mathematicians can call it what they like, but what you've described, this ridiculous guessing game, has squat-all to do with information." Alas, would that this were so: it is ridiculous, but it works. More: it was arrived at, simultaneously, by several mathematicians and engineers during World War II (among the Americans, most notably, Claude Shannon and Norbert Wiener), working on very serious and practical problems of coding, code-breaking, communication and automatic control. The real justification for regarding the entropy as the amount of information is that, unsightly though it is, though it's abstracted away all the content of the message and almost all of the context (except for the distribution over messages), it works. You can try to design a communication channel which doesn't respect the theorems of information theory; in fact, people did; you'll fail, as they did.

Of course, nothing really depends on guessing the contents of sealed envelopes; any sort of random variable will do.

The next natural extension is to say, "Well, I've got two envelopes here, and I want to know what all the messages are in both of them; how many questions will that take?" Call the two variables \( S \) and \( T \). (The case of more than two is a pretty simple extension, left to the reader's ingenuity and bored afternoons.) To find out the value of \( S \) takes \( H[S] \) questions; that of \( T \), \( H[T] \); so together we need at most \( H[S] + H[T] \) questions. But some combinations of messages may be more likely than others. If one of them is "Marry me?", the odds are good that the other is "Yes" or "No". So, by the same reasoning as before, we figure out the distribution of pairs of messages, and find its entropy, called the joint entropy, written \( H[S, T] \). Lo and behold, some algebra proves that \( H[S, T] \) is at most \( H[S] + H[T] \), and is always lower if the two variables are statistically dependent. Now suppose that we've figured out the contents of one message, \( S \) let us say (i.e. we've learned it's "Marry me?" or whatever): how many questions will it take us to find out the contents of \( T \)? This is the conditional entropy, the entropy of \( T \) given \( S \), written \( H[T|S] \), and a little thought shows it must be \( H[T, S] - H[S] \), for consistency. This finally leads us to the idea of the mutual information, written \( I[S; T] \), which is the amount we learn about \( T \) from knowing \( S \), i.e., the number of questions it saves us from having to ask, i.e., \( H[T] - H[T|S] \), which is, as it happens, always the same as \( H[S] - H[S|T] \). (Hence "mutual.") The mutual information quantifies how much one variable (say, the signal picked up by the receiver in the field) can tell us about another (say, the signal sent on the other end).

I should now talk about the source and channel coding theorems, and error-correcting codes, which are remarkably counter-intuitive beasts, but I don't feel up to it.

I should also talk about the connection to Kolmogorov complexity, too. Roughly, the Kolmogorov complexity of a sequence of symbols is the shortest computer program which will generate that sequence as its output. For certain classes of random processes, the Kolmogorov complexity per symbol converges, on average, to the entropy per symbol, which in that case is the entropy rate, the entropy of the latest symbol, conditioned on all the previous ones. This gives us a pretty profound result: random sequences are incompressible; and, conversely, an incompressible sequence looks random. In fact it turns out that one can write down formal analogs to almost all the usual theorems about information which talk, not about the entropy, but about the length of the Kolmogorov program, also for this reason called the algorithmic information. — And now, notebook of its own.

Norbert Wiener worked out the continuous case of the standard entropy/coding/communication channel part of information theory at the same time as Shannon was doing the discrete version; I don't know whether anything like this exists for algorithmic information theory.

In addition to the use in communications and technology, this stuff is also of some use in statistical physics (where "entropy" came from in the first place), indynamics (where we use an infinite family of generalizations of the Shannon entropy, the Rényi entropies), and in probability and statistics generally. There are important connections to deep issues about learning and induction, though I think they're often misconceived. (Another rant for another time.) Certainly the occasional people who say "this isn't a communication channel, so you can't use information theory" are wrong.

Equally out of it are physicists who try to use gzip to measure entropy.

Relation to other complexity measures, computational mechanics. What are the appropriate extensions to things other than simple time-series, e.g., spatially extended systems?