5
$\begingroup$

I want to know how the equation for binary cross entropy came about. My approach is the following:

Let's say we have two ground truths: $y_1$ and $y_2$. We also have two predictions $p_1$ and $p_2$. Now, $p_2$ can also be defined as $1 -p_1$ since we're dealing with a binary problem.

From this, how exactly do we arrive at this equation: $$−(y\log{p}+(1−y)\log{(1−p)})$$

And we think of this as a loss function, why does it make sense to minimize this equation?

$\endgroup$
1
  • 1
    $\begingroup$ Hint: What's the log-likelihood of a Bernoulli probability model? $\endgroup$ Commented May 20, 2018 at 17:40

1 Answer 1

7
$\begingroup$

Suppose there's a random variable $Y$ where $Y \in \{0,1\}$ (for binary classification), then the Bernoulli probability model will give us:

$$ L(p) = p^y (1-p)^{1-y} $$

$$ log(L(p)) = y\log p + (1-y) \log (1-p) $$

Its often easier to work with the derivatives when the metric is in terms of log and additionally, the min/max of loglikelihood is the same as the min/max of likelihood. The inherent meaning of a cost or loss function is such that the more it deviates from the 0, the worse the model performs. The negative sign the just preserves that meaning and is easier to interpret. Maximizing the above function will lead to the same result.

$\endgroup$
2
  • $\begingroup$ So maximizing log(L(p)) is the same as minimizing cross entropy as I have defined it? $\endgroup$ Commented May 21, 2018 at 3:09
  • $\begingroup$ The above equation has a maxima at 0 and for the rest of the values it is negative. Thus in the ideal case (perfect prediction), the value of log(L(p)) will be 0 and it will be its maxima. Conversely, the negative of log(L(p)) will have minima at 0. $\endgroup$ Commented May 21, 2018 at 7:33

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.