Normal, Chi-Square and Kolmogorov-Smirnov Statistics Functions in JavaScript (original) (raw)

Numerical Computations for Cryptography

Computations of combinatoric and statistics functions and inverses which deliver good accuracy over a wide range of values. Accuracy tests allow the functions to be checked in any computing environment.

ACiphers By Ritter Page

Terry Ritter

Last Update: 1998 June 24

Please send comments and suggestions for improvement to: ritter@ciphersbyritter.com. You may wish to help support this work by patronizingRitter's Crypto Bookshop.

Hypothesis Testing
Continuous Statistic Distributions
The Meaning of Statistic Distributions
Normal
Chi Square -- compare binned distribution counts
Kolmogorov-Smirnov -- compare distributions without using bins

Base Conversion -- base 2..64 to decimal and decimal to base 2..64
Logs
Powers
Permutations -- n things taken k at a time, order matters
Combinations -- n things taken k at a time, order does not matter
Binomial -- for success probability p and n trials, the probability of k successes
Bit Changes -- bit changes from keyed invertible substitution tables or ciphers
Poisson -- given mean u, the probability of k successes

Hypothesis Testing

One role of statisticsis to develop evidence that something_is_ or is not performing as expected. Typically we have many objects, any one of which may be selected atrandomand then measured. Over manysamples, measured values may recur with some fixed probability, and we can construct a frequencydistributionor density function experimentally. If we can form a theory or hypothesis of what that distribution should be, we can compare the theoretical distribution to the experimental one. And if the distributions seem to be the same, we can have some confidence that we know what is going on: We have "proven" the hypothesis of a particular distribution.

To compare distributions, we compute some sort of "goodness of fit"statisticon the samples. Now, the experimental objects will have a distribution of values, but the test statistic has its own different distribution. In "goodness of fit," the statistic distribution is usually what we expect if we compare any sample distribution to itself; it is thus the variation of random sampling alone. So if we take a number of experimental samples, we can compute a statistic value. Then we can calculate, from the known distribution for that statistic, how often that result should occur by chance alone, assuming the two distributions really are the same. And if we repeatedly find statistic values which are particularly unlikely, we can conclude that the distribution of experimental objects is not what we expected it to be.

Continuous Statistic Distributions

The normal,chi-square andKolmogorov-Smirnovstatistics are continuous distributions(in contrast to thebinomial andPoisson discrete distributions). Because continuous statistics are not limited to discrete values, there is almost no probability that a particular precise value will occur. We ask, therefore, about the probability of getting a particular value _or less,_or the value or more.

The probability of a particular value or less is just the area under the probability "density" curve to the left of the value; this is the "left tail." These statistics are normalized so that the total probability is 1; if we have the area to the left of a value, we can easily find the amount to the right by subtracting what we have from 1. And we can find the area between any two values by finding the area to the right value, and then subtracting the area to the left value. The same statistic probability distributions can be described from either tail, but I try to uniformly use the left tail, which is also the "cumulative distribution function" orc.d.f.The c.d.f. is just the accumulated sum or integral of the density function; it gives the area to the left of the parameter value. Whether we describe a distribution from the left tail or the right tail really does not matter as long as we understand what the associated probability means.

The Meaning of Statistic Distributions

The probability distribution for a statistic is usually what we expect to see when we sample a known distribution. Over many independent trials which sample ideal data for that statistic, we expect to find that about half of the trials will have a statistic value below the 0.5 probability value. Similarly, about half of the trials will have a value above the 0.5 probability value.

"Critical" statistic values are often on the right tail, and may indicate, for example, that 95 out of 100 trials should be below some statistic value. But this means that we expect to get 5 trials out of 100 above that critical value, as a result of random sampling. And even though a 5 percent chance is not very high, it is to be expected, and might even occur on the very first experiment.

To the extent that any statistic value can occur with some (even if fantastically unlikely) random sampling, I claim that simply finding a relatively unlikely statistic value yields no conclusion in itself. Of course, a high cost of experimentation may simply demand some decision which is more often right than wrong. We rarely have that situation incryptography. But if we repeatedly get many more than 5 percent of the trials in the upper 5 percent of the distribution area, this is statistical evidence that the distribution is not what we thought it was.

In addition to worrying about the upper 5 percent or 1 percent of the distribution, we also expect to see about a quarter of the samples in each quarter of the distribution. It can be important check this, because a valid experimental procedure _must_behave this way. If we concentrate solely on statistic values which may be too far out on the right tail (so-called "failures"), we could miss very serious problems in the experiment itself. If we repeatedly do not find about a quarter of the samples in each quarter of the distribution, we know the distribution is not what we expected it to be, even if we have very few "failures." Any deviation from what we expect means that we do not know what is going on.

A big part of the worth of statistics occurs when we can analyze a problem, predict the results, then have measurements agree with our predictions. Repeated agreement between theory and practice means that a lot of analysis and a lot of measurement instrumentation must be working properly. But if the results do not agree, we have only proven that we have not described what is really happening. The difference could be the result of any one of many issues in the analysis, the design of the experiment, or its conduct. So in "goodness-of-fit" testing, we only get "proof" when the two distributions are measurably the same, and _no proof at all_when they differ.

Normal

The normal distribution is the "bell shaped" curve we associate with grading curves and other population models. The normal distribution -- or something close to it -- appears often in nature. This probably happens because the sum of many "independent identically distributed" random variables has a normal distribution, independent of the distribution of the variables taken one-by-one.

The normal is actually a whole family of curves, which differ in their mean value and in the square root of their variance or standard deviation. Here the user can enter either the variance or standard deviation value, and the other will be updated automatically. The standardized result appears in z.

Chi Square

Chi-squareis the best knowngoodness of fitstatistic. Chi-square is a way to numerically compare twosampled distributions:

Some number of "bins" is selected, each typically covering a different but similar range of values.
Some much larger number of independent observations are taken. Each is measured and classified in some bin, and a count for that bin is incremented.
Each resulting bin count is compared to an expected value for that bin, based on the expected distribution. Each expectation is subtracted from the corresponding bin count, the difference is squared, then divided by the expectation:
X2 = SUM( (Observed[i] - Expected[i])2 / Expected[i] )
i

The sum of all the squared normalized differences is the chi-squarestatistic, and the distribution depends upon the number of bins through thedegrees of freedomor df. The df value is normally one less than the number of bins (though this will vary with different test structures). Ideally, we choose the number of bins and the number of samples to get at least ten counts in each bin. For distributions which trail off, it may be necessary to collect the counts (and the expectations) for some number of adjacent bins.

The chi-squarec.d.f. tells us how often a particular value or lower would be seen when sampling the expected distribution. Ideally we expect to see chi-square values on the same order as the df value, but often we see huge values for which there really is little point in evaluating a precise probability.

Kolmogorov-Smirnov

Kolmogorov-Smirnovis anothergoodness of fittest for comparing two distributions. Here the measurements need not be collected into "bins," but are instead re-arranged and placed in order of magnitude:

n independent samples are collected and arranged in numerical order in array X as_x_[0].._x_[_n_-1].
S(_x_[_j_]) is the cumulative distribution of the sample: the fraction of the n observations which are less than or equal to _x_[_j_]. In the ordered array this is just ((j+1)/n).
F(x) is the reference cumulative distribution, the probability that a random value will be less than or equal to x. Here we want F(_x_[_j_]), the fraction of the distribution to the left of _x_[_j_] which is a value from the array.

There are actually at least three different K-S statistics, and two different distributions:

The "one-sided" statistics are:

Dn+ = MAX( S(x[j]) - F(x[j]) ) = MAX( ((j+1)/n) - F(x[j]) )

Dn- = MAX( F(x[j]) - S(x[j]) ) = MAX( F(x[j]) - (j/n) )

where "MAX" is computed over all j from 0 to _n_-1. Both of these statistics have the same distribution.

The "two-sided" K-S statistic is:

Dn* = MAX( ABS( S(x[j]) - F(x[j]) ) ) = MAX( Dn+, Dn- )

and this has a somewhat different distribution.

Knuth II multiplies Dn+, Dn- and Dn* by SQRT(n) and calls them Kn+, Kn- and Kn*, so we might say that there are at least six different K-S statistics. So what do we use?

It turns out that the Dn* distribution is hard to compute with accuracy over the full range. There is a good transformation between the two distributions for values well out on the right tail, but this means we lose values for quarters of the distribution, and this is a significant loss. We also lose the ability to compute an arbitrary inverse. So, just like Knuth II, we support the "one-sided" versions.

The c.d.f. computation should give 4 good decimal places. The critical value and inverse computations iterate the c.d.f. with bisection for n under 100, which gives full accuracy, but is very, very slow. The critical values might take half a minute to come up, and the tests might well take a couple of minutes.

Terry Ritter, hiscurrent address, and histop page.