I know there are related questions already asked, for example this one.
I also know the following:
- KL divergence $D_{KL}(P\Vert Q)$ is given as:
$$\begin{align} D_{KL}(P\Vert Q) & = -\sum_xP(x)\log\left(\frac{Q(x)}{P(x)}\right) \\ & = \sum_xP(x)\log\left(\frac{P(x)}{Q(x)}\right) \\ & = \sum_xP(x)\log(P(x))-\sum_xP(x)\log(Q(x)) \\ & = -\underbrace{\sum_xP(x)\log\left(\frac{1}{P(x)}\right)}_{\text{Entropy } H(x)}\underbrace{-\sum_xP(x)\log(Q(x))}_{\text{Cross entropy } H(P,Q) } \\ & = -H(X)+H(P,Q) \qquad\qquad...\text{equation(1)}\\ \end{align} $$
- Cross Entropy is given as
$$H(P,Q)=-\sum_xP(x)\log Q(x)$$
(Please correct me if I am incorrect in above equations.)
Knowing all this, I want to build more precise intuition behind the difference.
Wikipedia defines KL divergence as follows:
KL divergence of P from Q is the expected "excess" surprise from using Q as a model when the actual distribution is P
I believe the word "excess" refers to term $-H(X)$ in equation (1) and we can drop it (essentially dropping $-H(X)$) to get definition for cross entropy:
Cross entropy of P from Q is the expected
"excess"surprise from using Q as a model when the actual distribution is P.
Q1. Am I correct with this?
Also, this article defines cross entropy as follows:
If we consider a target or underlying probability distribution $P$ and an approximation of the target distribution $Q$, then the cross-entropy of $Q$ from $P$ is the number of additional bits required to represent an event using $Q$ instead of $P$.
I believe this is wrong. Above definition should be of KL divergence and the word additional refers to $-H(x)$ term in equation (1). If we drop it (and hence $-H(x)$), we get the definition for cross entropy:
- Cross entropy of $Q$ from $P$ is the number of
"additional"bits required to represent an event using $Q$ instead of $P$.- KL divergence of $Q$ from $P$ is the number of "additional" bits required to represent an event using $Q$ instead of $P$.
Q2. Am I correct with these interpretations / definitions (in terms of bits requirements)?
Update
I came across this awesome answer, and after reading it, I guess I was correct in about interpretations. The answer states:
$D_{KL}(p\Vert q)$ measures the average number of extra bits per message, whereas $H(p,q)$ measures the average number of total bits per message.