Understanding intuitive difference between KL divergence and Cross entropy

Question

I know there are related questions already asked, for example this one.

I also know the following:

KL divergence $D_{KL}(P\Vert Q)$ is given as:

$$\begin{align} D_{KL}(P\Vert Q) & = -\sum_xP(x)\log\left(\frac{Q(x)}{P(x)}\right) \\ & = \sum_xP(x)\log\left(\frac{P(x)}{Q(x)}\right) \\ & = \sum_xP(x)\log(P(x))-\sum_xP(x)\log(Q(x)) \\ & = -\underbrace{\sum_xP(x)\log\left(\frac{1}{P(x)}\right)}_{\text{Entropy } H(x)}\underbrace{-\sum_xP(x)\log(Q(x))}_{\text{Cross entropy } H(P,Q) } \\ & = -H(X)+H(P,Q) \qquad\qquad...\text{equation(1)}\\ \end{align} $$

Cross Entropy is given as

$$H(P,Q)=-\sum_xP(x)\log Q(x)$$

(Please correct me if I am incorrect in above equations.)

Knowing all this, I want to build more precise intuition behind the difference.

Wikipedia defines KL divergence as follows:

KL divergence of P from Q is the expected "excess" surprise from using Q as a model when the actual distribution is P

I believe the word "excess" refers to term $-H(X)$ in equation (1) and we can drop it (essentially dropping $-H(X)$) to get definition for cross entropy:

Cross entropy of P from Q is the expected ~~"excess"~~ surprise from using Q as a model when the actual distribution is P.

Q1. Am I correct with this?

Also, this article defines cross entropy as follows:

If we consider a target or underlying probability distribution $P$ and an approximation of the target distribution $Q$, then the cross-entropy of $Q$ from $P$ is the number of additional bits required to represent an event using $Q$ instead of $P$.

I believe this is wrong. Above definition should be of KL divergence and the word additional refers to $-H(x)$ term in equation (1). If we drop it (and hence $-H(x)$), we get the definition for cross entropy:

Cross entropy of $Q$ from $P$ is the number of ~~"additional"~~ bits required to represent an event using $Q$ instead of $P$.

KL divergence of $Q$ from $P$ is the number of "additional" bits required to represent an event using $Q$ instead of $P$.

Q2. Am I correct with these interpretations / definitions (in terms of bits requirements)?

Update

I came across this awesome answer, and after reading it, I guess I was correct in about interpretations. The answer states:

$D_{KL}(p\Vert q)$ measures the average number of extra bits per message, whereas $H(p,q)$ measures the average number of total bits per message.

Stack Exchange Network

Understanding intuitive difference between KL divergence and Cross entropy

0

Linked

Hot Network Questions

Understanding intuitive difference between KL divergence and Cross entropy

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked

Related

Hot Network Questions