2
$\begingroup$

The equation for cross-entropy is: $H(p,q)=-\sum_x{p(x)\log{q(x)}}$

When working with a binary classification problem, the ground truth is often provided to us as binary (i.e. 1's and 0's).

If I assume $q$ is the ground truth, and $p$ are my predicted probabilities, I can get the following for examples where the true label is 0:

$\log\; 0 = -\inf$

How is this handled in practice in e.g. TensorFlow or PyTorch? (for both the forward pass and the backward pass)

$\endgroup$
2
  • $\begingroup$ you've reversed labels and predictions $\endgroup$ Commented Jul 5, 2020 at 13:00
  • $\begingroup$ @shimao Are $p$ and $q$ both supposed to be probabilities or hard labels? And if so, which one is supposed to be predicted vs provided here and why? $\endgroup$ Commented Jul 5, 2020 at 14:24

1 Answer 1

1
$\begingroup$

Exponentials of very small numbers can under flow to 0, leading to $\log(0)$. But this will never happen if you work on the logit scale. So, use logits. The algebra is tedious but you can rewrite cross entropy loss with softmax/sigmoid loss as an expression of logits. Elements of Statistical Learning does this in its discussion of binary logistic regression (section 4.4.1, p. 120).

Suppose your network has 1 output neuron that gives any real number $z$ as an output. We can interpret this number as the logit of the probability that $y=1$. The probability that $y=1$ given the logit is $\Pr(y=1)=\frac{1}{1+\exp(-z)}$ and likewise $\Pr(y=0)=\frac{\exp(-z)}{1+\exp(-z)}$.

Combining this expressions with the formula for binary cross entropy and doing some tedious algebra, we find $$\begin{align} H&=-y\log(\Pr(y=1))-(1-y)\log(\Pr(y=0))\\ &=-yz+\log\left(1+\exp(z)\right). \end{align}$$

This means you'll never worry about $\log(0)$ because the logarithm always takes a positive argument. We know $\exp(z)>0$ because $z \in \mathbb{R}$. Positive numbers are closed under addition, so $\log(1+\exp(z)) > 0$.

Numerically, we might be concerned about overflow from $\exp(z)$. This is easily avoided if we replace the softmax function $f(x)=\log(1+\exp(x))$ with the approximation $$ f(x) = \begin{cases}\log(1+\exp(x)) & x \le c \\ x & x > c\end{cases} $$ as $f$ is well-approximated as the identity function when $x$ is large. Choosing $c=20$ is typical, but it might need to be larger or smaller depending on the floating point precision.

$\endgroup$
2
  • $\begingroup$ Thanks Syrocorax! Great suggestions. And just to complete everything here, in addition to what you wrote above, in a classification setting, when using $H(p,q)$ as your loss, you would define $q$ as the probability output of the network, and $p$ as the ground truth labels, and not the other way around, right? (moreover, this should also fully avoid the $\log\; 0 = -\inf$ problem I mentioned, right?) $\endgroup$ Commented Jul 5, 2020 at 15:03
  • 1
    $\begingroup$ Yes, $p$ is the label and $q$ is the predicted probability of $y=1$. I've added clarification for the second part of your comment. $\endgroup$ Commented Jul 5, 2020 at 15:20

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.