Revisions to What is the difference between Cross-entropy and KL divergence?

The parameters were incorrectly reversed. See wikipedia for correct expression.

edit approved Jan 8, 2024 at 22:42

103
2

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$$$H(p, q) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$$$D_{KL}(p, q) = H(p, q) - H(p)$$ From the equation, we could see that KL divergence can departbe split into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

added 1 character in body

Source Link

edited Mar 13, 2023 at 8:06

User1865345

12k
13
27
42

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_ilog(q_i)}$$$$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

added 43 characters in body

Source Link

edited Jul 15, 2020 at 14:03

zewen liu

499
4
3

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_ilog(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.