Information Theory (original) (raw)
Last update: 04 Dec 2024 18:47
Imagine that someone hands you a sealed envelope, containing, say, a telegram. You want to know what the message is, but you can't just open it up and read it. Instead you have to play a game with the messenger: you get to ask yes-or-no questions about the contents of the envelope, to which he'll respond truthfully. Question: assuming this rather contrived and boring exercise is repeated many times over, and you get as clever at choosing your questions as possible, what's the smallest number of questions needed, on average, to get the contents of the message nailed down?
This question actually has an answer. Suppose there are only a finite number of messages ("Yes"; "No"; "Marry me?"; "In Reno, divorce final"; "All is known stop fly at once stop"; or just that there's a limit on the length of the messages, say a thousand characters). Then we can number the messages from 1 to \( N \). Call the message we get on this trial \( S \). Since the game is repeated many times, it makes sense to say that there's a probability \( p_i \) of getting message number \( i \) on any given trial, i.e. \( \Pr{(S=i)} = p_i \). Now, the number of yes-no questions needed to pick out any given message is, at most, \( \log{N} \), taking the logarithm to base two. (If you were allowed to ask questions with three possible answers, it'd be log to the base three. Natural logarithms would seem to imply the idea of their being 2.718... answers per question, but nonetheless make sense mathematically.) But one can do better than that: if message \( i \) is more frequent than message \( j \) (if \( p_i > p_j \) ), it makes sense to ask whether the message is \( i \) before considering the possibility that it's \( j \); you'll save time. One can in fact show, with a bit of algebra, that the smallest average number of yes-no questions is \[ -\sum_{i}{p_i\log{p_i}} ~. \] This gives us \( \log{N} \) when all the \( p_i \) are equal, which makes sense: then there are no prefered messages, and the order of asking doesn't make any difference. The sum is called, variously, the information, the information content, the self-information, the entropy or the Shannon entropy of the message, conventionally written \( H[S] \).
Now, at this point a natural and sound reaction would be to say "the mathematicians can call it what they like, but what you've described, this ridiculous guessing game, has squat-all to do with information." Alas, would that this were so: it is ridiculous, but it works. More: it was arrived at, simultaneously, by several mathematicians and engineers during World War II (among the Americans, most notably, Claude Shannon and Norbert Wiener), working on very serious and practical problems of coding, code-breaking, communication and automatic control. The real justification for regarding the entropy as the amount of information is that, unsightly though it is, though it's abstracted away all the content of the message and almost all of the context (except for the distribution over messages), it works. You can try to design a communication channel which doesn't respect the theorems of information theory; in fact, people did; you'll fail, as they did.
Of course, nothing really depends on guessing the contents of sealed envelopes; any sort of random variable will do.
The next natural extension is to say, "Well, I've got two envelopes here, and I want to know what all the messages are in both of them; how many questions will that take?" Call the two variables \( S \) and \( T \). (The case of more than two is a pretty simple extension, left to the reader's ingenuity and bored afternoons.) To find out the value of \( S \) takes \( H[S] \) questions; that of \( T \), \( H[T] \); so together we need at most \( H[S] + H[T] \) questions. But some combinations of messages may be more likely than others. If one of them is "Marry me?", the odds are good that the other is "Yes" or "No". So, by the same reasoning as before, we figure out the distribution of pairs of messages, and find its entropy, called the joint entropy, written \( H[S, T] \). Lo and behold, some algebra proves that \( H[S, T] \) is at most \( H[S] + H[T] \), and is always lower if the two variables are statistically dependent. Now suppose that we've figured out the contents of one message, \( S \) let us say (i.e. we've learned it's "Marry me?" or whatever): how many questions will it take us to find out the contents of \( T \)? This is the conditional entropy, the entropy of \( T \) given \( S \), written \( H[T|S] \), and a little thought shows it must be \( H[T, S] - H[S] \), for consistency. This finally leads us to the idea of the mutual information, written \( I[S; T] \), which is the amount we learn about \( T \) from knowing \( S \), i.e., the number of questions it saves us from having to ask, i.e., \( H[T] - H[T|S] \), which is, as it happens, always the same as \( H[S] - H[S|T] \). (Hence "mutual.") The mutual information quantifies how much one variable (say, the signal picked up by the receiver in the field) can tell us about another (say, the signal sent on the other end).
I should now talk about the source and channel coding theorems, and error-correcting codes, which are remarkably counter-intuitive beasts, but I don't feel up to it.
I should also talk about the connection to Kolmogorov complexity, too. Roughly, the Kolmogorov complexity of a sequence of symbols is the shortest computer program which will generate that sequence as its output. For certain classes of random processes, the Kolmogorov complexity per symbol converges, on average, to the entropy per symbol, which in that case is the entropy rate, the entropy of the latest symbol, conditioned on all the previous ones. This gives us a pretty profound result: random sequences are incompressible; and, conversely, an incompressible sequence looks random. In fact it turns out that one can write down formal analogs to almost all the usual theorems about information which talk, not about the entropy, but about the length of the Kolmogorov program, also for this reason called the algorithmic information. — And now, notebook of its own.
Norbert Wiener worked out the continuous case of the standard entropy/coding/communication channel part of information theory at the same time as Shannon was doing the discrete version; I don't know whether anything like this exists for algorithmic information theory.
In addition to the use in communications and technology, this stuff is also of some use in statistical physics (where "entropy" came from in the first place), indynamics (where we use an infinite family of generalizations of the Shannon entropy, the Rényi entropies), and in probability and statistics generally. There are important connections to deep issues about learning and induction, though I think they're often misconceived. (Another rant for another time.) Certainly the occasional people who say "this isn't a communication channel, so you can't use information theory" are wrong.
Equally out of it are physicists who try to use gzip to measure entropy.
Relation to other complexity measures, computational mechanics. What are the appropriate extensions to things other than simple time-series, e.g., spatially extended systems?
See also:
- Algorithmic Information Theory
- Directed Information and Transfer Entropy
- Ergodic Theory
- Estimating Entropies and Informations
- Information Geometry
- Information Theory, Large Deviations, and Statistics
- The Minimum Description Length Principle
- Recurrence Times of Stochastic Processes Recommended, big picture, non- or minimally technical:
- Ray and Charles Eames, A Communications Primer [Short film from, incredibly, 1953]
- J. R. Pierce, Symbols, Signals and Noise [The best non-technical book, indeed, almost the only one which isn't full of nonsense; but I must warn you he does use logarithms in a few places.] Recommended, close-ups, non-technical:
- William Poundstone, Fortune's Formula Recommended, big picture, technical:
- Thomas Cover and Joy Thomas, Elements of Information Theory [Is and deserves to be the standard text]
- Dave Feldman, Information Theory, Excess Entropy and Statistical Complexity [Introductory, and at a lower mathematical level than Cover and Thomas. Dave is an old, old friend, but a little log-rolling never hurt anyone...]
- R. M. Gray, Entropy and Information Theory[A mathematically rigorous treatment, using measure-theoretic probability]
- Claude Shannon and Warren Weaver, Mathematical Theory of Communication [The very first work on information theory, highly motivated by very practical problems of communication and coding; it's still interesting to read. (Weaver's essay is also of some historical interest.)] Recommended, close ups, asymptotic equipartition/ergodicity/convergence rates therein:
- Paul H. Algoet and Thomas M. Cover, "A Sandwich Proof of the Shannon-McMillan-Breiman Theorem", Annals of Probability 16 (1988): 899--909 [A very slick asymptotic equipartition result for relative entropy, with the usual equipartition theorem as a special case]
- Sergey Bobkov, Mokshay Madiman, "Concentration of the information in data with log-concave distributions", Annals of Probability 39 (2011): 1528--1543, arxiv:1012.5457
- Leo Breiman, "The Individual Ergodic Theorem of Information Theory",Annals of Mathematical Statistics 28 (1957): 809--811
- J.-R. Chazottes and D. Gabrielli, "Large deviations for empirical entropies of Gibbsian sources", math.PR/0406083 = Nonlinearity 18 (2005): 2545--2563 [This is a very cool result which shows that block entropies, and entropy rates estimated from those blocks, obey the large deviation principle even as one lets the length of the blocks grow with the amount of data, provided the block-length doesn't grow too quickly (only logarithmically). I wish I could write papers like this.]
- Matthew T. Harrison, "The Generalized Asymptotic Equipartition Property: Necessary and Sufficient Conditions", IEEE Transactions on Information Theory 54 (2008): 3211--3216, arxiv:0711.2666 Recommended, close-ups, applications to statistics and learning theory:
- Massimiliano Badino, "An Application of Information Theory to the Problem of the Scientific Experiment", Synthese 140 (2004): 355--389 [An interesting attempt to formulate experimental hypothesis testing in information-theoretic terms, with experiments serving as a channel between the world and the scientist. Badino makes what seems to me a very nice point, that if the source is ergodic (because, e.g., experiments are independent replicates), then almost surely a long enough sequence of experimental results will be "typical", in the sense of the asymptotic equipartition property, and so observing what your theory describes as an _a_-typical sequence is reason to reject that theory. Two problems with this, however, are that Badino assumes the theory completely specifies the probability of observations, i.e., no free parameters to be estimated from data, and he doesn't seem to be aware of any of the work relating information theory to hypothesis testing, which goes back at least to Kullback in the 1950s. I think something very interesting could be done here, about testing hypotheses on ergodic (not just IID) sources, but wonder if it hasn't been done already... MS Word preprint]
- David Balduzzi, "Information, learning and falsification",arxiv:1110.3592
- Hrayr Harutyunyan, Maxim Raginsky, Greg Ver Steeg, Aram Galstyan, "Information-theoretic generalization bounds for black-box learning algorithms", forthcoming in NeurIPS 2021, arxiv:2110.01584
- Solomon Kullback, Information Theory and Statistics
- Katalin Marton and Paul C. Shields, "Entropy and the Consistent Estimation of Joint Distributions", Annals of Probability 22 (1994): 960--977
- Jonathan Scarlett and Volkan Cevher, "An Introductory Guide to Fano's Inequality with Applications in Statistical Estimation", arxiv:1901.00555
- Jeffrey E. Steif, "Consistent estimation of joint distributions for sufficiently mixing random fields", Annals of Statistics 25 (1997): 293--304 [Extension of the Marton-Shields result to random fields in higher dimensions]
- Maxim Raginsky, "Divergence-based characterization of fundamental limitations of adaptive dynamical systems", arxiv:1010.2286
- Jorma Rissanen, Stochastic Complexity in Statistical Inquiry [Applications of coding ideas to statistical problems. Review: Less Is More, or Ecce data!]
- Abraham Wald, "Estimation of a Parameter When the Number of Unknown Parameters Increases Indefinitely with the Number of Observations",Annals of Mathematical Statistics 19 (1948): 220--227 Recommended, close-ups, information and evolution:
- Carl T. Bergstrom and Michael Lachmann, "The fitness value of information", q-bio.PE/0510007
- Hillol Kargupta, "Information Transmission in Genetic Algorithm and Shannon's Second Theorem", in Stephanie Forrest (ed.), roceedings of the 5th International Conference on Genetic Algorithms (1993)
- Olivier Rivoire and Stanislas Leibler, "The Value of Information for Populations in Varying Environments", arxiv:1010.5092 Recommended, close-ups, coordination:
- Paul Cuff, Haim Permuter and Thomas Cover, "Coordination Capacity", arxiv:0909.2408 ["...elements of a theory of cooperation and coordination in networks. Rather than considering a communication network as a means of distributing information, or of reconstructing random processes at remote nodes, we ask what dependence can be established among the nodes given the communication constraints. Specifically, in a network with communication rates ... between the nodes, we ask what is the set of all achievable joint distributions ... of actions at the nodes of the network. ... Distributed cooperation can be the solution to many problems such as distributed games, distributed control, and establishing mutual information bounds on the influence of one part of a physical system on another." But the networks considered are very simple, and a lot of them cheat by providing access to a common source of randomness. Still, an interesting direction!]
- Yu-Chi Ho, Marcia P. Kastner and Eugene Wong, "Teams, Signaling, and Information Theory", IEEE Transactions on Automatic Control 23 (1978): 305--312 [Ungated copy via Prof. Wong. Thanks to Maxim Raginsky for the reference.]
- Maxim Raginsky, "Empirical processes, typical sequences and coordinated actions in standard Borel spaces", arxiv:1009.0282 Recommended, close-ups, causality and causal inference:
- Maxim Raginsky, "Directed information and Pearl's causal calculus", arxiv:1110.0718
- Bastian Steudel and Nihat Ay, "Information-theoretic inference of common ancestors", arxiv:1010.5720 Recommended, close-ups, not otherwise, or not yet, classified:
- Riccardo Aragona, Francesca Marzi, Filippo Mignosi, Matteo Spezialetti, "Entropy and Compression: A simple proof of an inequality of Khinchin-Ornstein-Shields", Problems of Information Transmission 56 (2020), arxiv:1907.04713
- Jochen Brocker, "A Lower Bound on Arbitrary fff-Divergences in Terms of the Total Variation" arxiv:0903.1765
- Gavin Brown, Adam Pocock, Ming-Jie Zhao, Mikel Luján, "Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection", Journal of Machine Learning Research 13 (2012): 27--66
- Nicolo Cesa-Bianchi and Gabor Lugosi, Prediction, Learning, and Games [Mini-review]
- Karl-Erik Eriksson and Kristian Lindgren, "Entropy and Correlation in Lattice Systems", technical report, Physical Resource Theory Group, Chalmers University (n.d. but I read it in manuscript around 2000; I should see if it was ever formally published)
- Matthew T. Harrison and Ioannis Kontoyiannis, "Estimation of the Rate-Distortion Function", IEEE Transactions on Information Theory 54 (2008): 3757--3762, arxiv:cs/0702018
- Aleks Jakulin and Ivan Bratko, "Quantifying and Visualizing Attribute Interactions", cs.AI/0308002
- A. I. Khinchin, Mathematical Foundations of Information Theory [An axiomatic approach, for those who like that sort of thing]
- Ioannis Kontoyiannis, L. A. Lastras-Montano and S. P. Meyn, "Relative Entropy and Exponential Deviation Bounds for General Markov Chains", ISIT 2005 [PDF reprint via Prof. Meyn]
- Maxim Raginsky, Roummel F. Marcia, Jorge Silva and Rebecca M. Willett, "Sequential Probability Assignment via Online Convex Programming Using Exponential Families" [ISIT 2009; PDF]
- Maxim Raginsky, Rebecca M. Willett, C. Horn, Jorge Silva and Roummel F. Marcia, "Sequential anomaly detection in the presence of noise and limited feedback", IEEE Transactions on Information Theory 58 (2012): 5544--5562, arxiv:0911.2904
- Alfréd Rényi, "On Measures of Entropy and Information", pp. 547--561 in vol. I of Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability
- Rieke et al., Spikes: Exploring the Neural Code [Review: Cells that Go Ping, or, The Value of the Three-Bit Spike]
- Paul C. Shields, The Ergodic Theory of Discrete Sample Paths [Emphasis on ergodic properties relating to individual sample paths, as well as coding-theoretic arguments. Shield's page on the book.]
- Eric E. Thomson and William B. Kristan, "Quantifying Stimulus Discriminability: A Comparison of Information Theory and Ideal Observer Analysis",Neural Computation 17 (2005): 741--778 [A useful warning against a too-common abuse of information theory. Thanks to Eric for providing me with a pre-print.]
- Hugo Touchette and Seth Lloyd, "Information-Theoretic Approach to the Study of Control Systems,"physics/0104007 [Rediscovery and generalization of Ashby's "law of requisite variety" from the 1950s; more applications than he gave, but a more tortuous proof]
- Tim van Erven and Peter Harremoes
- "Renyi Divergence and Its Properties", arxiv:1001.4448
- "Renyi Divergence and Kullback-Leibler Divergence", arxiv:1206.2459
- Ramon van Handel, "Ergodicity, Decisions, and Partial Information", arxiv:1208.3213
- Greg Ver Steeg, Aram Galstyan, Fei Sha, and Simon DeDeo, "Demystifying Information-Theoretic Clustering", arxiv:1310.4210
- Benjamin Weiss, Single Orbit Dynamics Not recommended:
- Dario Benedetto, Emanuele Caglioti and Vittorio Loreto, "Language Trees and Zipping," cond-mat/0108530 [Though they're no worse than other people who use gzip as an approximation to the Kolmogorov complexity, this example, published in PRL, is especially egregious, and has called forth two separate and conclusive demolitions,cond-mat/0205521 and cond-mat/0202383]
- B. Roy Frieden, Physics from Fisher Information: A Unification[Review: Laboring to Bring Forth a Mouse] To be shot after a fair trial [the trial will be fair; my not explaining why I list these papers here is not]:
- P. Allegrini, V. Benci, P. Grigolini, P. Hamilton, M. Ignaccolo, G. Menconi, L. Palatella, G. Raffaelli, N. Scafetta, M. Virgilio and J. Jang, "Compression and diffusion: a joint approach to detect complexity,"cond-mat/0202123
- Andrea Baronchelli, Emanuele Caglioti and Vittorio Loreto, "Artificial sequences and complexity measures", Journal of Statistical Mechanics: Theory and Experiment (2005): P04002
- Hong-Da Chen, Chang-Heng Chang, Li-Ching Hsieh, and Hoong-Chien Lee, "Divergence and Shannon Information in Genomes", Physical Review Letters 94 (2005): 178103
- P. A. Varotsos, N. V. Sarlis, E. S. Skordas and M. S. Lazaridou
- "Entropy in the natural time-domain", physics/0501117= Physical Review E 70 (2004): 011106
- "Natural entropy fluctuations discriminate similar looking electric signals emitted from systems of different dynamics", physics/0501118= Physical Review E 71 (2005)
- David H. Wolpert, "Information Theory - The Bridge Connecting Bounded Rational Game Theory and Statistical Physics", cond-mat/0402508 [Frankly if it were anyone other than David saying such stuff, I wouldn't even bother to read it.] To read, history:
- Ronald R. Kline, The Cybernetics Moment: Or Why We Call Our Age the Information Age
- Jimmy Soni and Rob Goodman, A Mind at Play: How Claude Shannon Invented the Information Age To read, applications in biology, particularly evolution/genetics:
- C. T. Bergstrom and M. Rosvall, "The transmission sense of information", arxiv:0810.4168
- David J. C. MacKay. "Rate of Information Acquisition by a Species subjected to Natural Selection" [Link] To read, entropy and divergence of dynamical systems and stochastic processes:
- Armen E. Allahverdyan, "Entropy of Hidden Markov Processes via Cycle Expansion", arxiv:0810.4341
- Claudio Bonanno, "The Manneville map: topological, metric and algorithmic entropy,"math.DS/0107195
- G. Ciuperca, V. Girardin and L. Lhote, "Computation and Estimation of Generalized Entropy Rates for Denumerable Markov Chains", IEEE Transactions on Information Theory 57 (2011): 4026--4034 [The estimation is just plugging in the MLE of the parameters, for finitely-parametrized chains, but they claim to show that works well]
- Felix Creutzig, Amir Globerson and Naftali Tishby, "Past-future information bottleneck in dynamical systems", Physical Review E 79 (2009): 041925
- Gustavo Deco and Bernd Schurmann, Information Dynamics: Foundations and Applications
- Steffen Dereich, "The quantization complexity of diffusion processes", math.PR/0411597
- Tomasz Downarowicz, Entropy in Dynamical Systems
- H. Follmer, "On entropy and information gain in random fields", Z. Wahrsh. verw. Geb. 2691973): 207--217
- Pierre Gaspard, "Time-Reversed Dynamical Entropy and Irreversibility in Markovian Random Processes", Journal of Statistical Physics 117 (2004): 599--615
- George M. Gemelos and Tsachy Weissman, "On the Entropy Rate of Pattern Processes", cs.IT/0504046
- Guangyue Han, "Limit Theorems for the Sample Entropy of Hidden Markov Chains", arxiv:1102.0365
- Guangyue Han and Brian Marcus, "Analyticity of Entropy Rate in Families of Hidden Markov Chains", math.PR/0507235
- Mohamed Kafsi, Matthias Grossglauser, Patrick Thiran, "Entropy of Conditional Markov Trajectories", arxiv:1212.2831
- Brian Marcus, Karl Petersen and Tsachy Weissman (eds.), Entropy of Hidden Markov Processes and Connections to Dynamical Systems
- Ziad Rached, Fady Alajaji and L. Lorne Campbell
- "Rényi's Divergence and Entropy Rates for Finite Alphabet Markov Sources", IEEE Transactions on Information Theory 47 (2001): 1553--1561
- "The Kullback-Leibler Divergence Rate Between Markov Sources", IEEE Transactions on Information Theory 50 (2004): 917--921
- Mohammad Rezaeian, "Hidden Markov Process: A New Representation, Entropy Rate and Estimation Entropy", cs.IT/0606114
- Wojciech Slomczynski, Dynamical Entropy, Markov Operators, and Iterated Function Systems [Many thanks to Prof. Slomczynski for sending a copy of his work] To read, quantum information and friends:
- Robert Alicki, "Information-theoretical meaning of quantum dynamical entropy,"quant-ph/0201012
- Fabio Benatti, Tyll Krueger, Markus Mueller, Rainer Siegmund-Schultze and Arleta Szkola, "Entropy and Algorithmic Complexity in Quantum Information Theory: a Quantum Brudno's Theorem", quant-ph/0506080
- Igor Bjelakovic, Tyll Krueger, Rainer Siegmund-Schultze and Arleta Szkola
- "The Shannon-McMillan Theorem for Ergodic Quantum Lattice Systems,"math.DS/0207121
- "Chained Typical Subspaces - a Quantum Version of Breiman's Theorem,"quant-ph/0301177
- Alexander Stotland, Andrei A. Pomeransky, Eitan Bachmat and Doron Cohen, "The information entropy of quantum mechanical states", quant-ph/0401021 To read, upper and lower bounds to statistical inference:
- Yifeng Chu, Maxim Raginsky, "A unified framework for information-theoretic generalization bounds", arxiv:2305.11042
- Green and Swets, Signal Detection Theory and Psychophysics
- Fredrik Hellström, Giuseppe Durisi, Benjamin Guedj, Maxim Raginsky, "Generalization Bounds: Perspectives from Information Theory and PAC-Bayes", arxiv:2309.04381
- S.-W. Ho and S. Verdu, "On the Interplay Between Conditional Entropy and Error Probability", IEEE Transactions on Information Theory 56 (2010): 5930--5942
- Oliver Johnson and Andrew Barron, "Fisher Information inequalities and the Central Limit Theorem,"Probability Theory and Related Fields 129(2004): 391--409, math.PR/0111020
- F. Liang and A. Barron, "Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory 50 (2004): 2708--2726
- Eddy Mayer-Wolf and Moshe Zakai, "Some relations between mutual information and estimation error on Wiener space", math.PR/0610024
- Yuhong Yang and Andrew Barron, "Information-theoretic determination of minimax rates of convergence",Annals of Statistics 27 (1999): 1564--1599
- Tong Zhang, "Information-Theoretic Upper and Lower Bounds for Statistical Estimation", IEEE Transactions on Information Theory 52 (2006): 1307--1321 To read, information-theoretic approaches to model selection, especially variable/feature selection:
- Mario Beraha, Alberto Maria Metelli, Matteo Papini, Andrea Tirinzoni, Marcello Restelli, "Feature Selection via Mutual Information: New Theoretical Insights", arxiv:1907.07384 To read, not otherwise, or not yet, classified:
- Rudolf Ahlswede, "The final form of Tao’s inequality relating conditional expectation and conditional mutual information" [PDF preprint]
- Jose M. Amigo, Matthew B. Kennel and Ljupco Kocarev, "The permutation entropy rate equals the metric entropy rate for ergodic information sources and ergodic dynamical systems", nlin.CD/0503044
- Richard G. Baraniuk, Patrick Flandrin, Augustus J. E. M. Janssen and Olivier J. J. Michel, "Measuring Time-Frequency Information Content Using the Renyi Entropies", IEEE Transactions on Information Theory 47 (2001): 1391--1409
- Andrej Bratko, Gordon V. Cormack, Bogdan Filipic, Thomas R. Lynam, Blaz Zupan, "Spam Filtering Using Statistical Data Compression Models", Journal of Machine Learning Research 7 (2006): 2673--2698
- Paul Bohan Broderick, "On Communication and Computation",Minds and Machines 14 (2004): 1--19 ["The most famous models of computation and communication, Turing Machines and (Shannon-style) information sources, are considered. The most significant difference lies in the types of state-transitions allowed in each sort of model. This difference does not correspond to the difference that would be expected after considering the ordinary usage of these terms."]
- Joachim M. Buhmann, "Information theoretic model validation for clustering", arxiv:1006.0375
- Massimo Cencini and Alessandro Torcini, "A nonlinear marginal stability criterion for information propagation,"nlin.CD/0011044
- R. Chaves, L. Luft, T. O. Maciel, D. Gross, D. Janzing, B. Schölkopf, "Inferring latent structures via information inequalities", UAI 2014, arxiv:1407.2256
- J. Chen and T. Berger, "The Capacity of Finite-State Markov Channels with Feedback", IEEE Transactions on Information Theory 51(2005): 780--798
- Yifeng Chu, Maxim Raginsky, "Majorizing Measures, Codes, and Information", arxiv:2305.02960
- Bob Coecke, "Entropic Geometry from Logic,"quant-ph/0212065
- Imre Csiszar, "The Method of Types", IEEE Tranactions on Information Theory 44 (1998): 2505--2523 [free PDF copy]
- Imre Csiszar and Janos Korner, Information Theory: Coding Theorems for Discrete Memoryless Systems
- Łukasz Dębowski
- "On vocabulary size of grammar-based codes",cs.IT/0701047
- "Variable-Length Coding of Two-sided Asymptotically Mean Stationary Measures", Journal of Theoretical Probability 23 (2009): 237--256
- "Mixing, Ergodic, and Nonergodic Processes with Rapidly Growing Information between Blocks", arxiv:1103.3952
- "Hilberg Exponents: New Measures of Long Memory in the Process", arxiv:1403.1757
- "Is Natural Language a Perigraphic Process? The Theorem about Facts and Words Revisited", Entropy 20 (2018): 85
- Amir Dembo, "Information Inequalities and Concentration of Measure", The Annals of Probability 25 (1997): 927--939
- A. Dembo and I. Kontoyiannis, "Source Coding, Large Deviations, and Approximate Pattern Matching,"math.PR/0103007
- Joseph DeStefano and Erik Learned-Miller, "A Probabilistic Upper Bound on Differential Entropy", cs.IT/0504091
- Dowe, Korb and Oliver (eds.), Information, Statistics and Induction in Science
- M. Drmota and W. Szpankowski, "Precise minimax redundancy and regret", IEEE Transactions on Information Theory 50(2004): 2686--2707
- Ebanks, Sahoo and Sander, Characterization of Information Measures
- Werner Ebeling and Thorsten Poeschel, "Entropy and Long range correlations in literary English,"cond-mat/0204108
- Karl-Erik Eriksson, Kristian Lindgren, Bengt Å. Månsson, Structure, Context, Complexity, Organization: Physical Aspects of Information and Value [The sort of title which usually makes me run away, but actually full of content]
- Roger Filliger and Max-Olivier Hongler, "Relative entropy and efficiency measure for diffusion-mediated transport processes", Journal of Physics A: Mathematical and General 38 (2005): 1247--1255
- Flocchini et al. (eds.), Structure, Information and Communication Complexity
- David V. Foster and Peter Grassberger, "Lower bounds on mutual information", Physical Review E 83 (2011): 010101
- Travis Gagie, "Compressing Probability Distributions", cs.IT/0506016 [Abstract(in full): "We show how to store good approximations of probability distributions in small space."]
- Josep Ginebra, "On the Measure of the Information in a Statistical Experiment", Bayesian Analysis 2 (2007): 167--212
- M. Godavarti and A. Hero, "Convergence of Differential Entropies",IEEE Transactions on Information Theory 50(2004): 171--176
- Goldman, Information Theory [Old (1965) text, but has some interesting time-series stuff which has dropped out of most modern presentations]
- Alexander N. Gorban, Iliya V. Karlin and Hans Christian Ottinger, "The additive generalization of the Boltzmann entropy,"cond-mat/0209319 [The abstract sounds like a rediscovery of Renyi entropies --- there's a lot of that going around --- but presumably there's more]
- Christian Gourieroux, Alain Monfort and Eric Renault, "Kullback Causality Measures", Annales d'Economie et de Statistique 6--7 (1987): 369--410 [via Maxim Raginsky]
- A. Greven, G. Keller and G. Warnecke (eds.), Entropy
- Peter Grünwald and Paul Vitányi, "Shannon Information and Kolmogorov Complexity", cs.IT/0410002
- Sudipto Guha, Andrew McGregor and Suresh Venkatasubramanian, "Streaming and Sublinear Approximation of Entropy and Information Distances",17th ACM-SIAM Symposium on Discrete Algorithms, 2006[Link via Suresh]
- Michael J. W. Hall, "Universal Geometric Approach to Uncertainity, Entropy and Information,"physics/9903045
- Te Sun Han
- "Folklore in Source Coding: Information-Spectrum Approach", IEEE Transactions on Information Theory 51 (2005): 747--753 [From the abstract: "we verify the validity of the folklore that the output from any source encoder working at the optimal coding rate with asymptotically vanishing probability of error looks like almost completely random."]
- "An information-spectrum approach to large deviation theorems", cs.IT/0606104
- Te Sun Han and Kingo Kobayashi, Mathematics of Information and Coding
- Masahito Hayashi, "Second order asymptotics in fixed-length source coding and intrinsic randomness", cs.IT/0503089
- Nicolai T. A. Haydn, "The Central Limit Theorem for uniformly strong mixing measures", arxiv:0903.1325[With the goal being to prove that "the measures of cylinder sets are lognormally distributed", which is relevant to asymptotic equipartition]
- Nicolai Haydn and Sandro Vaienti, "Fluctuations of the Metric Entropy for Mixing Measures", Stochastics and Dynamics 4 (2004): 595--627
- D.-K. He and E.-H. Yang, "The Universality of Grammar-Based Codes for Sources With Countably Infinite Alphabets", IEEE Transactions on Information Theory 51 (2005): 3753--3765
- Torbjorn Helvik, Kristian Lindgren and Mats G. Nordahl, "Continity of Information Transport in Surjective Cellular Automata" [thanks to Mats for a preprint]
- S.-W. Ho and R. W. Yeung, "The Interplay Between Entropy and Variational Distance", IEEE Transactions on Information Theory 56 (2010): 5906--5929
- Michael Hochman, "Upcrossing Inequalities for Stationary Sequences and Applications to Entropy and Complexity", arxiv:math.DS/0608311 [where "complexity" = algorithmic information content]
- M. Hotta and I. Jochi, "Composability and Generalized Entropy,"cond-mat/9906377
- Shunsuke Ihara, Information Theory for Continuous Systems
- K. Iriyama, "Probability of Error for the Fixed-Length Lossy Coding of General Sources", IEEE Transactions on Information Theory 51 (2005): 1498--1507
- K. Iwata, K. Ikeada and H. Sakai, "A Statistical Property of Multiagent Learning Based on Markov Decision Process", IEEE Transactions on Neural Networks 17 (2006): 829--842 [The property is asymptotic equipartiton!]
- Herve Jegou and Christine Guillemot, "Entropy coding with Variable Length Re-writing Systems", cs.IT/0508058 [A construction method for codes where "the marginal bit probability converges to 0.5 as the sequence length increases and this is achieved even if the probability distribution function is not known by the encoder."]
- Petr Jizba and Toshihico Arimitsu, "The world according to Renyi: Thermodynamics of multifractal systems,"cond-mat/0207707
- Oliver Johnson
- "A conditional Entropy Power Inequality for dependent variables,"math.PR/0111021
- "Entropy and a generalisation of `Poincare's Observation',"math.PR/0201273
- Ido Kanter and Hanan Rosemarin, "Communication near the channel capacity with an absence of compression: Statistical Mechanical Approach,"cond-mat/0301005
- Holger Kantz and Thomas Schuermann, "Enlarged scaling ranges for the KS-entropy and the information dimension," Chaos 6 (1996): 167--171, cond-mat/0203439
- Matthew B. Kennel, "Testing time symmetry in time series using data compression dictionaries", Physical Review E 69 (2004): 056208
- J. C. Kieffer and E.-H. Yang, "Grammar-Based Lossless Universal Refinement Source Coding", IEEE Transactions on Information Theory 50 (2004): 1415--1424
- Ioannis Kontoyiannis
- "The Complexity and Entropy of Literary Styles"
- "Model Selection via Rate-Distortion Theory"
- "Some information-theoretic computations related to the distribution of prime numbers", arxiv:0710.4076
- Bernard H. Lavenda, "Information and coding discrimination of pseudo-additive entropies (PAE)", cond-mat/0403591
- Jaeho Lee, Maxim Raginsky, "Learning finite-dimensional coding schemes with nonlinear reconstruction maps", arxiv:1812.09658
- Tue Lehn-schioler, Anant Hegde, Deniz Erdogmus and Jose C. Principe, "Vector quantization using information theoretic concepts", Natural Computation 4 (2005): 39--51
- Kristian Lindgren, "Information Theory for Complex Systems" [Online lecture notes, January 2003]
- Niklas Lüdtke, Stefano Panzeri, Martin Brown, David S. Broomhead, Joshua Knowles, Marcelo A. Montemurro, Douglas B. Kell, "Information-theoretic sensitivity analysis: a general method for credit assignment in complex networks", Journal of the Royal Society: Interface forthcoming (2007)
- E. Lutwak, D. Yang and G. Zhang, "Cramer-Rao and Moment-Entropy Inequalities for Renyi Entropy and Generalized Fisher Information", IEEE Transactions on Information Theory 51 (2005): 473--478
- Christian K. Machens, "Adaptive sampling by information maximization,"physics/0112070
- David J. C. MacKay, Information Theory, Inference and Learning Algorithms [Online version]
- Donald Mackay, Information, Mechanism and Meaning[Trying to coax some notion of "meaning" out of information theory; I've not read it yet, but Mackay was quite good.]
- Mokshay Madiman and Prasad Tetali, "Information Inequalities for Joint Distributions, With Interpretations and Applications", IEEE Transactions on Information Theory 56 (2010): 2699--2713, arxiv:0901.0044
- Andrew J. Majda, Rafail V. Abramov and Marcus J. Grote,Information Theory and Stochastic for Multiscale Nonlinear Systems[PDF draft?]
- David Malone and Wayne J. Sullivan, "Guesswork and Entropy",IEEE Transactions on Information Theory 50(2004): 525--526
- Manolo Martinez, "Representations Are Rate-Distortion Sweet Spots",Philosophy of Science 86 (2019): 1214--1226
- Emin Martinian, Gregory W. wornell and Ram Zamir, "Source Coding with Encoder side Information", cs.IT/0512112
- Robert J. McEliece, The Theory of Information and Coding
- Clara Meister, Tiago Pimentel, Gian Wiher, Ryan Cotterell, "Locally Typical Sampling", arxiv:2202.00666
- N. Merhav and M. J. Weinberger, "On Universal Simulation of Information Sources Using Training Data", IEEE Transactions on Information Theory 50 (2004): 5--20; with Addendum, IEEE Transactions on Information Theory 51(2005): 3381--3383
- E. Meron and M. Feder, "Finite-Memory Universal Prediction of Individual Sequences", IEEE Transactions on Information Theory 50 (2004): 1506--1523
- Patrick Mitran, "Typical Sequences for Polish Alphabets",arxiv:1005.2321
- Roberto Monetti, Wolfram Bunk, Thomas Aschenbrenner and Ferdinand Jamitzky, "Characterizing synchronization in time series using information measures extracted from symbolic representations", Physical Review E 79 (2009): 046207
- Guido Montufar, Johannes Rauh, Nihat Ay, "Maximal Information Divergence from Statistical Models defined by Neural Networks", arxiv:1303.0268
- Ziad Naja, Florence Alberge and P. Duhamel, "Geometrical interpretation and improvements of the Blahut-Arimoto's algorithm", arxiv:1001.1915
- Ilya Nemenman, "Information theory, multivariate dependence, and genetic network inference", q-bio.QM/0406015
- E. Ordentlich and M. J. Weinberger, "A Distribution Dependent Refinement of Pinsker's Inequality", IEEE Transactions on Information Theory 51 (2005): 1836--1840 [As you know, Bob, Pinsker's inequality uses the total variation distance between two distributions to put a lower bound on their Kullback-Leibler divergence.]
- Leandro Pardo, Statistical Inference Based on Divergence Measures
- Milan Palus, "Coarse-grained entropy rate for characterization of complex time series", Physica D 93 (1996): 64--77 [Thanks to Prof. Palus for a reprint]
- Hanchuan Peng, Fuhui Long and Chris Ding, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy", IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005): 1226--1238 [This sounds like an idea I had in 2002, and was too dumb/lazy to follow up on.]
- Denes Petz, "Entropy, von Neumann and the von Neumann Entropy,"math-ph/0102013
- C.-E. Pfister and W. G. Sullivan, "Renyi entropy, guesswork moments, and large deviations", IEEE Transactions on Information Theory 50 (2004): 2794--2800
- Hong Qian, "Relative Entropy: Free Energy Associated with Equilibrium Fluctuations and Nonequilibrium Deviations", math-ph/0007010= Physical Review E 63 (2001): 042103
- Yaron Rachlin, Rohit Negi and Pradeep Khosla, "Sensing Capacity for Markov Random Fields", cs.IT/0508054
- Maxim Raginsky
- "Achievability results for statistical learning under communication constraints", arxiv:0901.1905
- "Joint universal lossy coding and identification of stationary mixing sources with general alphabets", arxiv:0901.1904
- M. Rao, Y. Chen, B. C. Vemuri and F. Wang, "Cumulative Residual Entropy: A New Measure of Information", IEEE Transactions on Information Theory 50 (2004): 1220--1228
- Johannes Rauh, "Optimally approximating exponential families", arxiv:1111.0483
- Juan Ramón Rico-Juan, Jorge Calera-Rubio and Rafael C. Carrasco, "Smoothing and compression with stochastic k-testable tree languages", Pattern Recognition 38 (2005): 1420--1430
- Reuven Y. Rubinstein, "A Stochastic Minimum Cross-Entropy Method for Combinatorial Optimization and Rare-event Estimation", Methodology and Computing in Applied Probability 7 (2005): 5--50
- Reuven Y. Rubinstein and Dirk P. Kroese, The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning,
- Brois Ryabko, "Applications of Universal Source Coding to Statistical Analysis of Time Series", arxiv:0809.1226
- Boris Ryabko and Jaakko Astola
- "Prediction of Large Alphabet Processes and Its Application to Adaptive Source Coding", cs.IT/0504079
- "Universal Codes as a Basis for Time Series Testing", cs.IT/0602084
- "Universal Codes as a Basis for Nonparametric Testing of Serial Independence for Time Series", cs.IT/0506094
- B. Ya. Ryabko and V. A. Monarev, "Using information theory approach to randomness testing", Journal of Statistical Planning and Inference 133 (2005); 95--110
- Daniil Ryabko, "Characterizing predictable classes of processes",arxiv:0905.4341
- Ines Samengo, "Information loss in an optimal maximum likelihood decoding," physics/0110074
- Thomas Schürmann, Peter Grassberger, "The predictability of letters in written English", Fractals 4 (1996): 1--5, arxiv:0710.4516 [Shades of Zellig Harris]
- Jacek Serafin, "Finitary Codes, a short survey",math.DS/0608252
- Gadiel Seroussi, "On the number of t-ary trees with a given path length", cs.DM/0509046 ["the number of ttt-ary trees with path length ppp estimates the number of universal types, or, equivalently, the number of different possible Lempel-Ziv'78 dictionaries for sequences of length ppp over an alphabet of size ttt."]
- Ofer Shayevitz, "A Note on a Characterization of Rényi Measures and its Relation to Composite Hypothesis Testing", arxiv:1012.4401 ["The R\'enyi information measures are characterized in terms of their Shannon counterparts, and properties of the former are recovered from first principle via the associated properties of the latter."]
- Rajesh Sundaresan, "Guessing under source uncertainty",cs.IT/0603064
- Joe Suzuki, "On Strong Consistency of Model Selection in Classification", IEEE Transactions on Information Theory 52 (2006): 4767--4774 [Based on information-theoretic criteria]
- H. Takashashi, "Redundancy of Universal Coding, Kolmogorov Complexity, and Hausdorff Dimension", IEEE Transactions on Information Theory 50 (2004): 2727--2736
- Vincent Y. F. Tan, Animashree Anandkumar, Lang Tong and Alan S. Willsky, "A Large-Deviation Analysis of the Maximum-Likelihood Learning of Markov Tree Structures", IEEE Transactions on Information Theory 57 (2011): 1714--1735, arxiv:0905.0940[Large deviations for Chow-Liu trees]
- Inder Jeet Taneja
- Generalized Information Measures and Their Applications
- "Inequalities Among Symmetric divergence Measures and Their Refinement", math.ST/0501303
- R. Timo, K. Blackmore and L. Hanlen, "Word-Valued Sources: An Ergodic Theorem, an AEP, and the Conservation of Entropy", IEEE Transactions on Information Theory 56 (2010): 3139--3148
- Pierre Tisseur, "A bilateral version of the Shannon-McMillan-Breiman Theorem", math.DS/0312125
- Gasper Tkacik, "From statistical mechanics to information theory: understanding biophysical information-processing systems", arxiv:1006.4291
- Marc M. Van Hulle, "Edgeworth Approximation of Multivariate Differential Entropy", Neural Computation 17 (2005): 1903--1910
- Nguyen Xuan Vinh, Julien Epps, James Bailey, "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance", Journal of Machine Learning Research 11 (2010): 2837--2854
- Bin Wang, "Minimum Entropy Approach to Word Segmentation Problems," physics/0008232
- Q. Wang, S. R. Kulkarni, and S. Verdu, "Divergence Estimation of Continuous Distributions Based on Data-Dependent Partitions", IEEE Transactions on Information Theory 51(2005): 3064--3074 [Sounds cool]
- Watanabe, Knowing and Guessing
- Edward D. Weinberger
- "A Theory of Pragmatic Information and Its Application to the Quasispecies Model of Biological Evolution,"nlin.AO/0105030
- "A Generalization of the Shannon-McMillan-Breiman Theorem and the Kelly Criterion Leading to a Definition of Pragmatic Information", arxiv:0903.2243
- T. Weissman and N. Merhav, "On Causal Source Codes With Side Information", IEEE Transactions on Information Theory 51 (2005): 4003--4013
- Jonas Wildberger, Siyuan Guo, Arnab Bhattacharyya, Bernhard Schölkopf, "On the Interventional Kullback-Leibler Divergence", arxiv:2302.05380
- Paul L. Williams and Randall D. Beer, "Nonnegative Decomposition of Multivariate Information", arxiv:1004.2515
- S. Yang, A. Kavcic and S. Tatikonda, "Feedback Capacity of Finite-State Machine Channels", IEEE Transactions on Information Theory 51(2005): 799--810
- Jiming Yu and Sergio Verdu, "Schemes for Bidirectional Modeling of Discrete Stationary Sources", IEEE Transactions on Information Theory 52 (2006): 4789--4807
- Lei Yu and Vincent Y. F. Tan, "Common Information, Noise Stability, and Their Extensions", Foundations and Trends in Communications and Information Theory 19 (2022): 107--389
- Jacob Ziv, "A Universal Prediction Lemma and Applications to Universal Data Compression and Prediction", IEEE Transactions on Information Theory 47(2001): 1528--1532
- Jacob Ziv, Neri Merhva, "On context-tree prediction of individual sequences", arxiv:cs/0508127 To write:
- CRS, "State Reconstruction and Source Coding"
- CRS, "Typical Measures of Complexity Grow Like Shannon Entropy"
Previous versions: 2005-11-09 15:33 (but first version was some time in the 1990s --- 1998?)