Kullback Leibler (KL) Divergence (original) (raw)

Last Updated : 11 Dec, 2025

Kullback Leibler Divergence is a measure from information theory that quantifies the difference between two probability distributions.

It tells us how much information is lost when we approximate a true distribution P with another distribution Q.
KL divergence is also called relative entropy and is non negative and asymmetric D_{\mathrm{KL}}(P \parallel Q) \neq D_{\mathrm{KL}}(Q \parallel P).
It measures the extra number of bits needed to encode data from P if we use a code optimized for Q instead of the true distribution P.

kullback_leibler_divergence

Divergence Graph

Mathematical Implementation

Mathematical Implementation of KL Divergence for discrete and continuous distributions:

**1. Discrete Distributions:

For two discrete probability distributions P = {p1, p2, ..., pn} and {q1, q2, ...., qn} over the same set:

D_{\mathrm{KL}}(P \parallel Q) = \sum_{i=1}^{n} p_i \log \frac{p_i}{q_i}

Step by step:

For each outcome i, compute pi log(pi / qi).
Sum all these terms to get the total KL divergence.

**2. Continuous Distributions:

For continuous probability density functions p(x) and q(x):

D_{\mathrm{KL}}(P \parallel Q) = \int p(x) \log \frac{p(x)}{q(x)} \, dx

Integration replaces summation for continuous variables.
Gives the expected extra information in nats or bits required when assuming q(x) instead of p(x).

**Properties

Properties of KL Divergence are:

**1. Non Negativity: KL divergence is always non negative and equals zero if and only if P=Q almost everywhere.

D_{\mathrm{KL}}(P \parallel Q) \ge 0

**2. Asymmetry: KL divergence is not symmetric so it is not a true distance metric.

D_{\mathrm{KL}}(P \parallel Q) \neq D_{\mathrm{KL}}(Q \parallel P)

**3. Additivity for Independent Distributions: If X and Y are independent:

D_{\mathrm{KL}}(P_{X,Y} \parallel Q_{X,Y}) = D_{\mathrm{KL}}(P_X \parallel Q_X) + D_{\mathrm{KL}}(P_Y \parallel Q_Y)

**4. Invariance under Parameter Transformations: KL divergence remains the same under bijective transformations of the random variable.

**5. Expectation Form: It can be interpreted as the expected logarithmic difference between probabilities under P and Q.

D_{\mathrm{KL}}(P \parallel Q) = \mathbb{E}_{x \sim P} \Big[ \log \frac{P(x)}{Q(x)} \Big]

**Implementation

Suppose there are two boxes that contain 4 types of balls (green, blue, red, yellow). A ball is drawn from the box randomly having the given probabilities. Our task is to calculate the difference of distributions of two boxes i.e KL divergence.

Step 1: Probability Distributions

Defining the probability distributions:

box_1 and box_2 are two discrete probability distributions.
Each value represents the probability of picking a colored ball from a box: green, blue, red, yellow.
For example, in box_1, the probability of picking a green ball is 0.25. Python `

box =[P(green),P(blue),P(red),P(yellow)]

box_1 = [0.25, 0.33, 0.23, 0.19] box_2 = [0.21, 0.21, 0.32, 0.26]

Step 2: Import Libraries

Importing libraries like Numpy and rel_entr from Scipy.

Python `

import numpy as np from scipy.special import rel_entr

Step 3: Custom KL Divergence Function

Defining a custom KL divergence function:

**1. Formula used:

D_{\mathrm{KL}}(P \parallel Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}

**2. Step by step:

Loop through each probability in distributions a and b.
Compute the term: a[i]⋅log(a[i]/b[i]).
Sum all these terms. Python `

def kl_divergence(a, b): return sum(a[i] * np.log(a[i]/b[i]) for i in range(len(a)))

Step 4: Calculate KL Divergence Manually

Calculating KL divergence manually:

kl_divergence(box_1, box_2) calculates D_{\mathrm{KL}}(\text{box}_1 \parallel \text{box}_2)
kl_divergence(box_2, box_1) calculates D_{\mathrm{KL}}(\text{box}_2 \parallel \text{box}_1)
KL divergence is asymmetric so the two results are usually different. Python `

print('KL-divergence(box_1 || box_2): %.3f ' % kl_divergence(box_1,box_2)) print('KL-divergence(box_2 || box_1): %.3f ' % kl_divergence(box_2,box_1))

Step 5: KL Divergence of a Distribution with Itself

Here, D_{\mathrm{KL}}(P \parallel P) = 0 for any distribution P.
This is because the distribution is identical so there is no divergence. Python `

D( p || p) =0

print('KL-divergence(box_1 || box_1): %.3f ' % kl_divergence(box_1,box_1))

**Output:

KL-divergence(box_1 || box_2): 0.057
KL-divergence(box_2 || box_1): 0.056
KL-divergence(box_1 || box_1): 0.000

Step 6: Use Scipy's `rel_entr` Function

Using Scipy's module to compute KL Divergence.

rel_entr(a, b) computes the element wise KL divergence term:

\text{rel\_entr}(a_i, b_i) = a_i \log \frac{a_i}{b_i}

sum(rel_entr(box_1, box_2)) sums all the terms to get the total KL divergence.
Using rel_entr is more efficient and avoids manually looping over the array.
Results from rel_entr should match the manual calculation. Python `

print("Using Scipy rel_entr function") box_1 = np.array(box_1) box_2 = np.array(box_2)

print('KL-divergence(box_1 || box_2): %.3f ' % sum(rel_entr(box_1,box_2))) print('KL-divergence(box_2 || box_1): %.3f ' % sum(rel_entr(box_2,box_1))) print('KL-divergence(box_1 || box_1): %.3f ' % sum(rel_entr(box_1,box_1)))

**Output:

Using Scipy rel_entr function
KL-divergence(box_1 || box_2): 0.057
KL-divergence(box_2 || box_1): 0.056
KL-divergence(box_1 || box_1): 0.000

**Applications

Some of the applications of KL Divergence are:

**Information Theory: Quantifies how much information is lost when one probability distribution is used to approximate another.
**Machine Learning: Forms the basis of loss functions like cross entropy is used in variational autoencoders (VAEs) and improves classification accuracy.
**Natural Language Processing: Supports language modeling, word embedding comparisons and topic modeling approaches such as Latent Dirichlet Allocation (LDA).
**Computer Vision: Used in VAEs, GANs and recognition systems to align generated data with real world distributions.
**Anomaly Detection: Identifies unusual or suspicious patterns by measuring distribution shifts, helpful in fraud detection and cybersecurity.

Use of KL Divergence in AI

Some specific use cases of KL Divergence in AI are:

**Probabilistic Models: Aligns learned distributions with target distributions like in Variational Autoencoders (VAEs).
**Reinforcement Learning: Stabilizes policy updates in algorithms like PPO by limiting divergence from previous policies.
**Language Models: Guides token probability distributions during fine-tuning and model distillation.
**Generative Models: Measures how closely generated data matches real data distributions.

KL Divergence vs Other Distance Measures

Comparison table of KL Divergence with Other Distance Measures:

Measure	Symmetry	Range	Interpretation	Use Cases
KL Divergence	No	**[0, ∞)	Measures how much information is lost when one distribution approximates another	VAEs, PPO, NLP, anomaly detection
Jensen Shannon Divergence	Yes	[0, 1] (normalized)	Smoothed, symmetric version of KL	GANs, text similarity
Hellinger Distance	Yes	[0, 1]	Measures geometric similarity between distributions	Probability comparisons, clustering
Total Variation Distance	Yes	[0, 1]	Maximum difference between probabilities over all events	Robust statistics, hypothesis testing
Wasserstein Distance	Yes	[0, ∞)	Minimum “cost” of transforming one distribution into another	GANs (WGAN), image generation

Limitations

Some of the limitations of KL Divergence are:

**Asymmetry: KL divergence is not symmetric (KL(P∣∣Q) \neq (KL(Q∣∣P)) so the “distance” from P to Q is not the same as from Q to P. This makes interpretation harder compared to true metrics.
**Infinite Values: If distribution Q(x)=0 in places where P(x)>0, the divergence becomes infinite. This can cause issues in practice especially with sparse or imperfectly estimated distributions.
**Support Mismatch Sensitivity: KL requires Q to have nonzero probability wherever P does. If the supports don’t overlap well, the measure breaks down or becomes unstable.
**Not a True Distance Metric: KL doesn’t satisfy properties like symmetry and triangle inequality so it cannot be used directly as a “distance” in geometric sense.
**Mode Seeking Behavior: Minimizing KL(P∣∣Q) tends to make Q focus only on the most likely regions of P and ignore less probable regions which can cause problems in generative modeling.