KL Divergence measures how much one probability distribution \(P\) differs from a second, reference probability distribution \(Q\). It quantifies the "distance" or "divergence" of \(P\) from \(Q\).
Think back to the cross-entropy scenario: \(KL(P || Q)\) represents the extra average bits/nats required per message to encode events from distribution \(P\) using a code optimized for distribution \(Q\), compared to using the optimal code based on \(P\) itself. It's the "cost" or "inefficiency" incurred by using the wrong distribution \(Q\) as an approximation for \(P\).
For two discrete probability distributions\(P = \{p_1, ..., p_n\}\) and \(Q = \{q_1, ..., q_n\}\) defined over the same set of events, the Kullback-Leibler (KL) divergence of \(Q\) from \(P\) (also called the relative entropy of \(P\) with respect to \(Q\)) is:
$$ D_{KL}(P || Q) = \sum_{i=1}^n p_i \log\left(\frac{p_i}{q_i}\right) $$
Logarithm Base: Base determines units (bits for \(\log_2\), nats for \(\ln\)). Use \(\ln\) (nats) for ML context.
Handling Zeros:
If \(p_i = 0\), the term is \(0 \log(0/q_i) = 0\).
If \(q_i = 0\) but \(p_i > 0\), \(D_{KL}\) is infinite (reflecting that \(Q\) assigns zero probability to an event that can actually happen under \(P\)). Requires \(Q\) to be absolutely continuous with respect to \(P\).
1. Interpretation: Information Gain / Inefficiency¶
\(D_{KL}(P || Q)\) measures the information lost when \(Q\) is used to approximate \(P\).
Equivalently, it's the extra information needed to specify outcomes from \(P\) when only knowledge of \(Q\) is available.
It's the average difference between the "surprise" using the true distribution (\(-\log p_i\)) and the "surprise" using the approximate distribution (\(-\log q_i\)), weighted by the true probabilities \(p_i\):
\(D_{KL}(P || Q) = E_P[-\log Q(X)] - E_P[-\log P(X)]\)\(D_{KL}(P || Q) = H(P, Q) - H(P)\) (Cross-Entropy minus Entropy)
Zero Divergence:\(D_{KL}(P || Q) = 0\) if and only if \(P = Q\) almost everywhere.
Asymmetry: Crucially, KL divergence is not symmetric:
$$ D_{KL}(P || Q) \neq D_{KL}(Q || P) $$
Therefore, it is not a true distance metric. Measuring the divergence of Q from P is different from measuring the divergence of P from Q. The choice matters depending on the application.
(Example Intuition: If P is sharp and Q is broad, \(D_{KL}(P||Q)\) might be large (surprising Q doesn't predict P's sharp peak well). If Q is sharp and P is broad, \(D_{KL}(P||Q)\) penalizes Q for assigning low probability where P has mass).
Relationship to Entropy and Cross-Entropy:
$$ D_{KL}(P || Q) = H(P, Q) - H(P) $$
This is why minimizing cross-entropy\(H(P, Q)\) when \(P\) is fixed (true distribution) is equivalent to minimizing \(D_{KL}(P || Q)\).
Variational Inference & Variational Autoencoders (VAEs): KL divergence is often used as a regularization term in the loss function of VAEs. It encourages the learned approximate posterior distribution \(Q\) (often Gaussian) of latent variables to stay close to a prior distribution \(P\) (often standard Gaussian \(N(0, I)\)).
Generative Adversarial Networks (GANs): While not always explicit in the standard loss, related divergence measures (like Jensen-Shannon divergence, which is symmetric and derived from KL) are used to understand GAN training dynamics.
Model Comparison/Selection: Can be used to quantify how much information is lost when approximating a complex model/distribution (\(P\)) with a simpler one (\(Q\)).