The Redundancy of English (original) (raw)
Terry Ritter
ACiphers By Ritter Page
What is the redundancy of English text, and what does the "index of coincidence" have to do with it?
Contents
- 2000-10-30 JPeschel:"The index of coincidence is a statistical measure of the redundancy of text.
- 2000-10-31 John Savard:"The index of coincidence itself is the probability that two letters, from a source stream, will be the same letter. This increases as redundancy increases, but it is not itself the redundancy."
- 2000-11- 1 Bill Unruh:"There are many redundancies."
- 2000-11- 3 Douglas A. Gwyn:"For a correct answer see Shannon's paper."
- 2000-11- 3 JPeschel:"I think this is the applicable paragraph by Shannon."
- 2000-11- 3 Matt Mahoney:"In 1950 Shannon estimated the entropy of written English to be 0.6 to 1.3 bits per character . . . ."
Subject: Re: Calculating the redudancy of english? Date: 30 Oct 2000 23:57:35 GMT From: jpeschel@aol.commune.org (JPeschel) Message-ID: 20001030185735.02982.00000131@ng-cs1.aol.com References: 8tktfv$cem$1@nnrp1.deja.com Newsgroups: sci.crypt Lines: 35
Simon Johnson pabalo@dimension.h3o.org writes:
What is the index of coincidence, Applied crypto doesn't seem to give enough info for me to estimate the redudanct of english.
The index of coincidence is a statistical measure of the redundancy of text. Around 1920, William Friedman published the article, "The Index of Coincidence and its Applications in Cryptography."
http://www.aegeanparkpress.com/books_by_author.html
The expected IC of English is around .667.
Instead of trying to write the equation here, I'll refer you to:
http://members.fortunecity.com/jpeschel/gillog1.htm
where you'll find the equation.
On this page, Jim uses the IC to do a ciphertext-only crack of Enigma messages.
You'll also find C source for an IC program on the "Historic" page of my web site.
Joe
Joe Peschel D.O.E. SysWorks http://members.aol.com/jpeschel/index.htm
Subject: Re: Calculating the redudancy of english? Date: Tue, 31 Oct 2000 01:11:04 GMT From: jsavard@fNrOeSePnAeMt.edmonton.ab.ca (John Savard) Message-ID: 39fe170f.1005468@news.powersurfr.com References: 8tktfv$cem$1@nnrp1.deja.com 20001030171002.18842.00001199@ng-fi1.aol.com Newsgroups: sci.crypt Lines: 79
On Mon, 30 Oct 2000 22:42:08 GMT, Simon Johnson pabalo@dimension.h3o.org wrote, in part:
In article 20001030171002.18842.00001199@ng-fi1.aol.com, jpeschel@aol.commune.org (JPeschel) wrote:
Simon Johnson pabalo@dimension.h3o.org writes:
How does one calculate the redudancy of english?
Use the index of coincidence.
What is the index of coincidence, Applied crypto doesn't seem to give enough info for me to estimate the redudanct of english.
No; Applied Cryptography doesn't discuss things like letter frequency because it is mostly used in attacking pencil and paper ciphers, which are the sort of things your "kid sister" might solve as puzzles - and which are explicitly excluded from coverage in the book.
The index of coincidence, however, will only tell you how redundant English text, encrypted by a secure method of letter transposition, is. The actual redundancy of English text cannot be measured, only lower bounds can be set for it: its single-letter frequencies indicate it must have this much redundancy, its digraph frequencies indicate it must also have this much more, its word frequencies indicate it must have this much more yet.
Thus, getting the "real" redundancy of the English language requires you construct a mathematical model that only generates sensible English text, but isn't limited to only a fraction of the possible English texts.
Still, a formula is required to calculate each one of those lower bounds. And it's something even my web page does not describe. (If you were a mathematician, though, my page on optimized Morse armor would give you enough clues to reinvent it.) "Cryptography: A Primer", by Alan Konheim, does describe it, and so does "Elementary Cryptanalysis", by Abraham Sinkov, and so does "The Codebreakers", by David Kahn.
The index of coincidence itself is the probability that two letters, from a source stream, will be the same letter. This increases as redundancy increases, but it is not itself the redundancy.
Its formula is:
( the sum, over all possible letters i, of n(i) times (n(i) - 1) ) divided by ( N times ( N-1 ) )
where N is the number of letters in a message, and n(i) is the number of times the letter whose index is i occurs in the message.
Given a probability distribution, rather than a specific text, one would use the formula
the sum, over all possible letters i, of ( p(i) squared )
where p(i) is the probability of the letter whose index is i.
But for the redundancy, what you want is the entropy of the source stream.
Each letter of an English text consumes logbase 2 bits, or logbase 10 digits, of bandwidth.
The amount of bandwidth English text deserves, based only on letter frequencies, would be:
the sum, over all possible letters i, of ( p(i) * log( 1/p(i) ) )
where p(i) is the probability of the letter whose index is i.
And the redundancy of English text is given by
1 - ( deserved bandwith / consumed bandwidth ).
So this is how to calculate redundancy from frequency tables.
John Savard http://home.ecn.ab.ca/~jsavard/crypto.htm
Subject: Re: Calculating the redudancy of english? Date: 1 Nov 2000 21:35:43 GMT From: unruh@physics.ubc.ca (Bill Unruh) Message-ID: 8tq2bf$2dt$1@nntp.itservices.ubc.ca References: 8tkosd$84d$1@nnrp1.deja.com Newsgroups: sci.crypt Lines: 11
In 8tkosd$84d$1@nnrp1.deja.com Simon Johnson pabalo@dimension.h3o.org writes:
How does one calculate the redudancy of english?
Depends on what you mean. There are many redundancies. Eg, take the trigrams and calculate Sum Pi ln Pi for the trigrams. or take the frequencies of individual letters and calculate the same thing.
Then you have to decide, are you going to take a typical passage (in which many words occur far far far more frequently than others) or a dictionary ( where all words occur equally-- once).
Subject: Re: Calculating the redudancy of english? Date: Fri, 3 Nov 2000 15:14:43 GMT From: "Douglas A. Gwyn" gwyn@arl.army.mil Message-ID: 3A02D663.20AE4BB5@arl.army.mil References: 20001030171002.18842.00001199@ng-fi1.aol.com 8tkosd$84d$1@nnrp1.deja.com Newsgroups: sci.crypt
JPeschel wrote:
Simon Johnson pabalo@dimension.h3o.org writes:
How does one calculate the redudancy of english? Use the index of coincidence.
No.
For a correct answer see Shannon's paper.
Subject: Re: Calculating the redudancy of english? Date: 03 Nov 2000 20:48:55 GMT From: jpeschel@aol.commune.org (JPeschel) Message-ID: 20001103154855.06266.00001078@ng-fi1.aol.com References: 3A02D663.20AE4BB5@arl.army.mil Newsgroups: sci.crypt Lines: 48
Douglas A. Gwyn writes:
JPeschel wrote:
Simon Johnson pabalo@dimension.h3o.org writes:
How does one calculate the redudancy of english? Use the index of coincidence.
No.
For a correct answer see Shannon's paper.
Oops. Thanks, Doug. I think this is the applicable paragraph by Shannon.
"The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols will be called its relative entropy. This is the maximum compression possible when we encode into the same alphabet. One minus the relative entropy is the redundancy. The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%. This means that when we write English half of what we write is determined by the structure of the language and half is chosen freely. The figure 50% was found by several independent methods which all gave results in this neighborhood. One is by calculation of the entropy of the approximations to English. A second method is to delete a certain fraction of the letters from a sample of English text and then let someone attempt to restore them. If they can be restored when 50% are deleted the redundancy must be greater than 50%. A third method depends on certain known results in cryptography."
I'm off to re-read Finnegan's Wake now. :-)
Joe
Joe Peschel D.O.E. SysWorks http://members.aol.com/jpeschel/index.htm
Subject: Re: Calculating the redudancy of english? Date: Fri, 03 Nov 2000 21:21:11 -0500 From: Matt Mahoney matmahoney@yahoo.com Message-ID: 3A037297.12ED6CB5@yahoo.com References: 8tkosd$84d$1@nnrp1.deja.com Newsgroups: sci.crypt Lines: 22
Simon Johnson wrote:
How does one calculate the redudancy of english?
In 1950 Shannon estimated the entropy of written English to be 0.6 to 1.3 bits per character, based on how well people can predict successive characters in text. Cover and King in 1978 estimated 1.3 bpc using a gambling game in which people placed bets on the next letter rather than just ranking them. I believe that this method is flawed because people tend to overestimate the odds of unlikely events (e.g. lotteries), which would overestimate the entropy. I did some simulations of the Shannon game which suggest a value around 1.0 bpc. (See http://cs.fit.edu/~mmahoney/dissertation/entropy1.html).
I'm surprised that more research hasn't been done on this question. But until we solve the AI problem, we have to use human subjects, because a statistical model of English would be equivalent to passing the Turing test. The best model to date is one developed by Rosenfeld in 1996 on hundreds of megabytes of text at 1.2 bpc. Among text compressors, RK is probably the best at 1.8-1.9 bpc. Gzip gets about 3 bpc.
-- Matt Mahoney, matmahoney@yahoo.com, http://cs.fit.edu/~mmahoney/
Terry Ritter, hiscurrent address, and histop page.
Last updated: 2001-06-27