Cross-entropy is commonly used to quantify the difference between two probability distributions. In the context of machine learning, it is a measure of error for categorical multi-class classification problems. Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.
Where p(x) is the true probability distribution (one-hot), and q(x) the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :
Note that it does not matter what logarithm base you use as long as you consistently use the same one. As it happens, the Python and Numpy log() functions compute the natural log (log base e).
Here is the above example expressed in Python using Numpy:
import numpy as np p = np.array([0, 1, 0]) # True probability (one-hot) q = np.array([0.228, 0.619, 0.153]) # Predicted probability cross_entropy_loss = -np.sum(p * np.log(q)) print(cross_entropy_loss) # 0.47965000629754095
So that is how "wrong" or "far away" your prediction is from the true distribution. A machine learning optimizer will attempt to minimize the loss (i.e. it will try to reduce the loss from 0.479 to 0.0).
Extreme examples
To gain more intuition on what these loss values reflect, let's look at some extreme examples.
Again, let's suppose the true (one-hot) distribution is:
Pr(Class A) Pr(Class B) Pr(Class C) 0.0 1.0 0.0
Now suppose your machine learning algorithm did a really great job and predicted class B with very high probability:
Pr(Class A) Pr(Class B) Pr(Class C) 0.001 0.998 0.001
When we compute the cross entropy loss, we can see that the loss is tiny, only 0.002:
p = np.array([0, 1, 0]) q = np.array([0.001, 0.998, 0.001]) print(-np.sum(p * np.log(q))) # 0.0020020026706730793
At the other extreme, suppose your ML algorithm did a terrible job and predicted class C with high probability instead. The resulting loss of 6.91 will reflect the larger error.
Pr(Class A) Pr(Class B) Pr(Class C) 0.001 0.001 0.998
p = np.array([0, 1, 0]) q = np.array([0.001, 0.001, 0.998]) print(-np.sum(p * np.log(q))) # 6.907755278982137
Now, what happens in the middle of these two extremes? Suppose your ML algorithm can't make up its mind and predicts the three classes with nearly equal probability.
Pr(Class A) Pr(Class B) Pr(Class C) 0.333 0.333 0.334
The resulting loss is 1.10.
p = np.array([0, 1, 0]) q = np.array([0.333, 0.333, 0.334]) print(-np.sum(p * np.log(q))) # 1.0996127890016931
Fitting into gradient descent