Fano's inequality (original) (raw)

From Wikipedia, the free encyclopedia

Inequality applying to random variables

In information theory, Fano's inequality (also known as the Fano converse and the Fano lemma) relates the average information lost in a noisy channel to the probability of the categorization error. It was derived by Robert Fano in the early 1950s while teaching a Ph.D. seminar in information theory at MIT, and later recorded in his 1961 textbook.

It is used to find a lower bound on the error probability of any decoder as well as the lower bounds for minimax risks in density estimation.

Let the discrete random variables X {\displaystyle X} {\displaystyle X} and Y {\displaystyle Y} {\displaystyle Y} represent input and output messages with a joint probability P ( x , y ) {\displaystyle P(x,y)} {\displaystyle P(x,y)}. Let e {\displaystyle e} {\displaystyle e} represent an occurrence of error; i.e., that X ≠ X ~ {\displaystyle X\neq {\tilde {X}}} {\displaystyle X\neq {\tilde {X}}}, with X ~ = f ( Y ) {\displaystyle {\tilde {X}}=f(Y)} {\displaystyle {\tilde {X}}=f(Y)} being an approximate version of X {\displaystyle X} {\displaystyle X}. Fano's inequality is

H ( X ∣ Y ) ≤ H b ( e ) + P ( e ) log ⁡ ( | X | − 1 ) , {\displaystyle H(X\mid Y)\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1),} {\displaystyle H(X\mid Y)\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1),}

where X {\displaystyle {\mathcal {X}}} {\displaystyle {\mathcal {X}}} denotes the support of X {\displaystyle X} {\displaystyle X}, | X | {\displaystyle |{\mathcal {X}}|} {\displaystyle |{\mathcal {X}}|} denotes the cardinality of (number of elements in) X {\displaystyle {\mathcal {X}}} {\displaystyle {\mathcal {X}}},

H ( X ∣ Y ) = − ∑ i , j P ( x i , y j ) log ⁡ P ( x i ∣ y j ) {\displaystyle H(X\mid Y)=-\sum _{i,j}P(x_{i},y_{j})\log P(x_{i}\mid y_{j})} {\displaystyle H(X\mid Y)=-\sum _{i,j}P(x_{i},y_{j})\log P(x_{i}\mid y_{j})}

is the conditional entropy,

P ( e ) = P ( X ≠ X ~ ) {\displaystyle P(e)=P(X\neq {\tilde {X}})} {\displaystyle P(e)=P(X\neq {\tilde {X}})}

is the probability of the communication error, and

H b ( e ) = − P ( e ) log ⁡ P ( e ) − ( 1 − P ( e ) ) log ⁡ ( 1 − P ( e ) ) {\displaystyle H_{b}(e)=-P(e)\log P(e)-(1-P(e))\log(1-P(e))} {\displaystyle H_{b}(e)=-P(e)\log P(e)-(1-P(e))\log(1-P(e))}

is the corresponding binary entropy.

Define an indicator random variable E {\displaystyle E} {\displaystyle E}, that indicates the event that our estimate X ~ = f ( Y ) {\displaystyle {\tilde {X}}=f(Y)} {\displaystyle {\tilde {X}}=f(Y)} is in error,

E := { 1 if X ~ ≠ X , 0 if X ~ = X . {\displaystyle E:={\begin{cases}1~&{\text{ if }}~{\tilde {X}}\neq X~,\\0~&{\text{ if }}~{\tilde {X}}=X~.\end{cases}}} {\displaystyle E:={\begin{cases}1~&{\text{ if }}~{\tilde {X}}\neq X~,\\0~&{\text{ if }}~{\tilde {X}}=X~.\end{cases}}}

Consider H ( E , X | X ~ ) {\displaystyle H(E,X|{\tilde {X}})} {\displaystyle H(E,X|{\tilde {X}})}. We can use the chain rule for entropies to expand this in two different ways

H ( E , X ∣ X ~ ) = H ( X ∣ X ~ ) + H ( E ∣ X , X ~ ) ⏟ = 0 = H ( E ∣ X ~ ) + H ( X ∣ E , X ~ ) {\displaystyle {\begin{aligned}H(E,X\mid {\tilde {X}})&=H(X\mid {\tilde {X}})+\underbrace {H(E\mid X,{\tilde {X}})} _{=0}\\&=H(E\mid {\tilde {X}})+H(X\mid E,{\tilde {X}})\end{aligned}}} {\displaystyle {\begin{aligned}H(E,X\mid {\tilde {X}})&=H(X\mid {\tilde {X}})+\underbrace {H(E\mid X,{\tilde {X}})} _{=0}\\&=H(E\mid {\tilde {X}})+H(X\mid E,{\tilde {X}})\end{aligned}}}

Equating the two

H ( X ∣ X ~ ) = H ( E ∣ X ~ ) + H ( X ∣ E , X ~ ) {\displaystyle H(X\mid {\tilde {X}})=H(E\mid {\tilde {X}})+H(X\mid E,{\tilde {X}})} {\displaystyle H(X\mid {\tilde {X}})=H(E\mid {\tilde {X}})+H(X\mid E,{\tilde {X}})}

Expanding the right most term, H ( X ∣ E , X ~ ) {\displaystyle H(X\mid E,{\tilde {X}})} {\displaystyle H(X\mid E,{\tilde {X}})}

H ( X ∣ E , X ~ ) = H ( X ∣ E = 0 , X ~ ) ⏟ = 0 ⋅ P ( E = 0 ) + H ( X ∣ E = 1 , X ~ ) ⋅ P ( E = 1 ) ⏟ = P ( e ) = H ( X ∣ E = 1 , X ~ ) ⋅ P ( e ) {\displaystyle {\begin{aligned}H(X\mid E,{\tilde {X}})&=\underbrace {H(X\mid E=0,{\tilde {X}})} _{=0}\cdot P(E=0)+H(X\mid E=1,{\tilde {X}})\cdot \underbrace {P(E=1)} _{=P(e)}\\&=H(X\mid E=1,{\tilde {X}})\cdot P(e)\end{aligned}}} {\displaystyle {\begin{aligned}H(X\mid E,{\tilde {X}})&=\underbrace {H(X\mid E=0,{\tilde {X}})} _{=0}\cdot P(E=0)+H(X\mid E=1,{\tilde {X}})\cdot \underbrace {P(E=1)} _{=P(e)}\\&=H(X\mid E=1,{\tilde {X}})\cdot P(e)\end{aligned}}}

Since E = 0 {\displaystyle E=0} {\displaystyle E=0} means X = X ~ {\displaystyle X={\tilde {X}}} {\displaystyle X={\tilde {X}}}; being given the value of X ~ {\displaystyle {\tilde {X}}} {\displaystyle {\tilde {X}}} allows us to know the value of X {\displaystyle X} {\displaystyle X} with certainty. This makes the term H ( X ∣ E = 0 , X ~ ) = 0 {\displaystyle H(X\mid E=0,{\tilde {X}})=0} {\displaystyle H(X\mid E=0,{\tilde {X}})=0}. On the other hand, E = 1 {\displaystyle E=1} {\displaystyle E=1} means that X ~ ≠ X {\displaystyle {\tilde {X}}\neq X} {\displaystyle {\tilde {X}}\neq X}, hence given the value of X ~ {\displaystyle {\tilde {X}}} {\displaystyle {\tilde {X}}}, we can narrow down X {\displaystyle X} {\displaystyle X} to one of | X | − 1 {\displaystyle |{\mathcal {X}}|-1} {\displaystyle |{\mathcal {X}}|-1} different values, allowing us to upper bound the conditional entropy H ( X ∣ E = 1 , X ~ ) ≤ log ⁡ ( | X | − 1 ) {\displaystyle H(X\mid E=1,{\tilde {X}})\leq \log(|{\mathcal {X}}|-1)} {\displaystyle H(X\mid E=1,{\tilde {X}})\leq \log(|{\mathcal {X}}|-1)}. Hence

H ( X ∣ E , X ~ ) ≤ log ⁡ ( | X | − 1 ) ⋅ P ( e ) . {\displaystyle H(X\mid E,{\tilde {X}})\leq \log(|{\mathcal {X}}|-1)\cdot P(e).} {\displaystyle H(X\mid E,{\tilde {X}})\leq \log(|{\mathcal {X}}|-1)\cdot P(e).}

The other term, H ( E ∣ X ~ ) ≤ H ( E ) {\displaystyle H(E\mid {\tilde {X}})\leq H(E)} {\displaystyle H(E\mid {\tilde {X}})\leq H(E)}, because conditioning reduces entropy. Because of the way E {\displaystyle E} {\displaystyle E} is defined, H ( E ) = H b ( e ) {\displaystyle H(E)=H_{b}(e)} {\displaystyle H(E)=H_{b}(e)}, meaning that H ( E ∣ X ~ ) ≤ H b ( e ) {\displaystyle H(E\mid {\tilde {X}})\leq H_{b}(e)} {\displaystyle H(E\mid {\tilde {X}})\leq H_{b}(e)}. Putting it all together,

H ( X ∣ X ~ ) ≤ H b ( e ) + P ( e ) log ⁡ ( | X | − 1 ) {\displaystyle H(X\mid {\tilde {X}})\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1)} {\displaystyle H(X\mid {\tilde {X}})\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1)}

Because X → Y → X ~ {\displaystyle X\rightarrow Y\rightarrow {\tilde {X}}} {\displaystyle X\rightarrow Y\rightarrow {\tilde {X}}} is a Markov chain, we have I ( X ; X ~ ) ≤ I ( X ; Y ) {\displaystyle I(X;{\tilde {X}})\leq I(X;Y)} {\displaystyle I(X;{\tilde {X}})\leq I(X;Y)} by the data processing inequality, and hence H ( X ∣ X ~ ) ≥ H ( X ∣ Y ) {\displaystyle H(X\mid {\tilde {X}})\geq H(X\mid Y)} {\displaystyle H(X\mid {\tilde {X}})\geq H(X\mid Y)}, giving us

H ( X ∣ Y ) ≤ H b ( e ) + P ( e ) log ⁡ ( | X | − 1 ) {\displaystyle H(X\mid Y)\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1)} {\displaystyle H(X\mid Y)\leq H_{b}(e)+P(e)\log(|{\mathcal {X}}|-1)}

Fano's inequality can be interpreted as a way of dividing the uncertainty of a conditional distribution into two questions given an arbitrary predictor. The first question, corresponding to the term H b ( e ) {\displaystyle H_{b}(e)} {\displaystyle H_{b}(e)}, relates to the uncertainty of the predictor. If the prediction is correct, there is no more uncertainty remaining. If the prediction is incorrect, the uncertainty of any discrete distribution has an upper bound of the entropy of the uniform distribution over all choices besides the incorrect prediction. This has entropy log ⁡ ( | X | − 1 ) {\displaystyle \log(|{\mathcal {X}}|-1)} {\displaystyle \log(|{\mathcal {X}}|-1)}. Looking at extreme cases, if the predictor is always correct the first and second terms of the inequality are 0, and the existence of a perfect predictor implies X {\displaystyle X} {\displaystyle X} is totally determined by Y {\displaystyle Y} {\displaystyle Y}, and so H ( X | Y ) = 0 {\displaystyle H(X|Y)=0} {\displaystyle H(X|Y)=0}. If the predictor is always wrong, then the first term is 0, and H ( X ∣ Y ) {\displaystyle H(X\mid Y)} {\displaystyle H(X\mid Y)} can only be upper bounded with a uniform distribution over the remaining choices.

Alternative formulation

[edit]

Let X {\displaystyle X} {\displaystyle X} be a random variable with density equal to one of r + 1 {\displaystyle r+1} {\displaystyle r+1} possible densities f 1 , … , f r + 1 {\displaystyle f_{1},\ldots ,f_{r+1}} {\displaystyle f_{1},\ldots ,f_{r+1}}. Furthermore, the Kullback–Leibler divergence between any pair of densities cannot be too large,

D K L ( f i ∥ f j ) ≤ β {\displaystyle D_{KL}(f_{i}\parallel f_{j})\leq \beta } {\displaystyle D_{KL}(f_{i}\parallel f_{j})\leq \beta } for all i ≠ j . {\displaystyle i\not =j.} {\displaystyle i\not =j.}

Let ψ ( X ) ∈ { 1 , … , r + 1 } {\displaystyle \psi (X)\in \{1,\ldots ,r+1\}} {\displaystyle \psi (X)\in \{1,\ldots ,r+1\}} be an estimate of the index. Then

sup i P i ( ψ ( X ) ≠ i ) ≥ 1 − β + log ⁡ 2 log ⁡ r {\displaystyle \sup _{i}P_{i}(\psi (X)\not =i)\geq 1-{\frac {\beta +\log 2}{\log r}}} {\displaystyle \sup _{i}P_{i}(\psi (X)\not =i)\geq 1-{\frac {\beta +\log 2}{\log r}}}

where P i {\displaystyle P_{i}} {\displaystyle P_{i}} is the probability induced by f i {\displaystyle f_{i}} {\displaystyle f_{i}}.

The following generalization is due to Ibragimov and Khasminskii (1979), Assouad and Birge (1983).

Let F be a class of densities with a subclass of r + 1 densities ƒ θ such that for any θ ≠ _θ_′

‖ f θ − f θ ′ ‖ L 1 ≥ α , {\displaystyle \|f_{\theta }-f_{\theta '}\|_{L_{1}}\geq \alpha ,} {\displaystyle \|f_{\theta }-f_{\theta '}\|_{L_{1}}\geq \alpha ,}

D K L ( f θ ∥ f θ ′ ) ≤ β . {\displaystyle D_{KL}(f_{\theta }\parallel f_{\theta '})\leq \beta .} {\displaystyle D_{KL}(f_{\theta }\parallel f_{\theta '})\leq \beta .}

Then in the worst case the expected value of error of estimation is bound from below,

sup f ∈ F E ‖ f n − f ‖ L 1 ≥ α 2 ( 1 − n β + log ⁡ 2 log ⁡ r ) {\displaystyle \sup _{f\in \mathbf {F} }E\|f_{n}-f\|_{L_{1}}\geq {\frac {\alpha }{2}}\left(1-{\frac {n\beta +\log 2}{\log r}}\right)} {\displaystyle \sup _{f\in \mathbf {F} }E\|f_{n}-f\|_{L_{1}}\geq {\frac {\alpha }{2}}\left(1-{\frac {n\beta +\log 2}{\log r}}\right)}

where ƒ n is any density estimator based on a sample of size n.