Skip to main content

You are not logged in. Your edit will be placed in a queue until it is peer reviewed.

We welcome edits that make the post easier to understand and more valuable for readers. Because community members review edits, please try to make the post substantially better than how you found it, for example, by fixing grammar or adding additional resources and hyperlinks.

Required fields*

8
  • 8
    $\begingroup$ This answer is what I was looking for. In my own current experience, which involves learning a target probabilities, BCE is way more robust than KL. Basically, KL was unusable. KL and BCE aren't "equivalent" loss functions. $\endgroup$ Commented Nov 29, 2019 at 16:31
  • $\begingroup$ When you said "the first part" and "the second part", which one was which? $\endgroup$ Commented May 30, 2020 at 20:27
  • 1
    $\begingroup$ @zewen's answer can be misleading as he claims that in mini-batch training, CE can be more robust than KL. In most of standard mini-batch training, we use gradient-based approach, and the gradient of $H(p)$ with respect to $q$ (which is a function of our model parameter) would be zero. So in these cases, CE and KL as a loss function are identical. $\endgroup$ Commented Sep 23, 2021 at 13:41
  • 1
    $\begingroup$ Are you sure the 1st formula is correct? Seems the p,d are ordered wrong. $\endgroup$ Commented Sep 28, 2022 at 3:29
  • 1
    $\begingroup$ I don't understand why the $H(p)$ constant makes the training less robust. The gradient should still be exactly the same, no? So is it just that your loss curve may look a bit more jiggly, but you training is still unchanged? $\endgroup$ Commented Dec 9, 2023 at 19:33