Skip to main content
The parameters were incorrectly reversed. See wikipedia for correct expression.
Source Link

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$$$H(p, q) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$$$D_{KL}(p, q) = H(p, q) - H(p)$$ From the equation, we could see that KL divergence can departbe split into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(p, q) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(p, q) - H(p)$$ From the equation, we could see that KL divergence can be split into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

added 1 character in body
Source Link
User1865345
  • 12k
  • 13
  • 27
  • 42

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_ilog(q_i)}$$$$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_ilog(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

added 43 characters in body
Source Link
zewen liu
  • 499
  • 4
  • 3

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_ilog(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_ilog(q_i)}$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_ilog(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$ From the equation, we could see that KL divergence can depart into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

added 30 characters in body
Source Link
zewen liu
  • 499
  • 4
  • 3
Loading
deleted 1 character in body
Source Link
zewen liu
  • 499
  • 4
  • 3
Loading
added 1 character in body
Source Link
zewen liu
  • 499
  • 4
  • 3
Loading
added 528 characters in body
Source Link
zewen liu
  • 499
  • 4
  • 3
Loading
Source Link
zewen liu
  • 499
  • 4
  • 3
Loading