Numerical computation of cross entropy in practice

Question

The equation for cross-entropy is: $H(p,q)=-\sum_x{p(x)\log{q(x)}}$

When working with a binary classification problem, the ground truth is often provided to us as binary (i.e. 1's and 0's).

If I assume $q$ is the ground truth, and $p$ are my predicted probabilities, I can get the following for examples where the true label is 0:

$\log\; 0 = -\inf$

How is this handled in practice in e.g. TensorFlow or PyTorch? (for both the forward pass and the backward pass)

@shimao Are $p$ and $q$ both supposed to be probabilities or hard labels? And if so, which one is supposed to be predicted vs provided here and why? — Josh
– Josh, Commented Jul 5, 2020 at 14:24

Sycorax · Accepted Answer · 2020-08-11 16:05:21Z

Exponentials of very small numbers can under flow to 0, leading to $\log(0)$. But this will never happen if you work on the logit scale. So, use logits. The algebra is tedious but you can rewrite cross entropy loss with softmax/sigmoid loss as an expression of logits. Elements of Statistical Learning does this in its discussion of binary logistic regression (section 4.4.1, p. 120).

Suppose your network has 1 output neuron that gives any real number $z$ as an output. We can interpret this number as the logit of the probability that $y=1$. The probability that $y=1$ given the logit is $\Pr(y=1)=\frac{1}{1+\exp(-z)}$ and likewise $\Pr(y=0)=\frac{\exp(-z)}{1+\exp(-z)}$.

Combining this expressions with the formula for binary cross entropy and doing some tedious algebra, we find $$\begin{align} H&=-y\log(\Pr(y=1))-(1-y)\log(\Pr(y=0))\\ &=-yz+\log\left(1+\exp(z)\right). \end{align}$$

This means you'll never worry about $\log(0)$ because the logarithm always takes a positive argument. We know $\exp(z)>0$ because $z \in \mathbb{R}$. Positive numbers are closed under addition, so $\log(1+\exp(z)) > 0$.

Numerically, we might be concerned about overflow from $\exp(z)$. This is easily avoided if we replace the softmax function $f(x)=\log(1+\exp(x))$ with the approximation $$ f(x) = \begin{cases}\log(1+\exp(x)) & x \le c \\ x & x > c\end{cases} $$ as $f$ is well-approximated as the identity function when $x$ is large. Choosing $c=20$ is typical, but it might need to be larger or smaller depending on the floating point precision.

Thanks Syrocorax! Great suggestions. And just to complete everything here, in addition to what you wrote above, in a classification setting, when using $H(p,q)$ as your loss, you would define $q$ as the probability output of the network, and $p$ as the ground truth labels, and not the other way around, right? (moreover, this should also fully avoid the $\log\; 0 = -\inf$ problem I mentioned, right?) — Josh
– Josh, Commented Jul 5, 2020 at 15:03
Yes, $p$ is the label and $q$ is the predicted probability of $y=1$. I've added clarification for the second part of your comment. — Sycorax
– Sycorax ♦, Commented Jul 5, 2020 at 15:20

Stack Exchange Network

Numerical computation of cross entropy in practice

1 Answer 1

Linked

Hot Network Questions

Numerical computation of cross entropy in practice

1 Answer 1

Linked

Related

Hot Network Questions