Revisions to What is cross-entropy? [closed]

added 85 characters in body

edited May 16, 2021 at 4:28

stackoverflowuser2010

41.5k
52
178
229

Where p(x) is the true probability distribution (one-hot), and q(x) is the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

Logarithm base

Note that it does not matter what logarithm base you use as long as you consistently use the same one. As it happens, the Python and Numpy log() functions computefunction computes the natural log (log base e).

Python code

Where p(x) is the true probability distribution (one-hot), and q(x) the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

Note that it does not matter what logarithm base you use as long as you consistently use the same one. As it happens, the Python and Numpy log() functions compute the natural log (log base e).

Where p(x) is the true probability distribution (one-hot) and q(x) is the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

Logarithm base

Note that it does not matter what logarithm base you use as long as you consistently use the same one. As it happens, the Python Numpy log() function computes the natural log (log base e).

Python code

added 428 characters in body

Source Link

edited May 13, 2021 at 20:46

stackoverflowuser2010

41.5k
52
178
229

ExtremeLoss units

We see in the above example that the loss is 0.4797. Because we are using the natural log (log base e), the units are in nats, so we say that the loss is 0.4797 nats. If the log were instead log base 2, then the units are in bits. See this page for further explanation.

More examples

Added Python code and more examples.

Source Link

edited Apr 25, 2021 at 23:27

stackoverflowuser2010

41.5k
52
178
229

Cross-entropy is commonly used to quantify the difference between two probability distributions. In the context of machine learning, it is a measure of error for categorical multi-class classification problems. Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

Where p(x) is the true probability distribution (one-hot), and q(x) the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

Note that it does not matter what logarithm base you use as long as you consistently use the same one. As it happens, the Python and Numpy log() functions compute the natural log (log base e).

Here is the above example expressed in Python using Numpy:

import numpy as np p = np.array([0, 1, 0]) # True probability (one-hot) q = np.array([0.228, 0.619, 0.153]) # Predicted probability cross_entropy_loss = -np.sum(p * np.log(q)) print(cross_entropy_loss) # 0.47965000629754095

So that is how "wrong" or "far away" your prediction is from the true distribution. A machine learning optimizer will attempt to minimize the loss (i.e. it will try to reduce the loss from 0.479 to 0.0).

Extreme examples

To gain more intuition on what these loss values reflect, let's look at some extreme examples.

Again, let's suppose the true (one-hot) distribution is:

Pr(Class A) Pr(Class B) Pr(Class C) 0.0 1.0 0.0

Now suppose your machine learning algorithm did a really great job and predicted class B with very high probability:

Pr(Class A) Pr(Class B) Pr(Class C) 0.001 0.998 0.001

When we compute the cross entropy loss, we can see that the loss is tiny, only 0.002:

p = np.array([0, 1, 0]) q = np.array([0.001, 0.998, 0.001]) print(-np.sum(p * np.log(q))) # 0.0020020026706730793

At the other extreme, suppose your ML algorithm did a terrible job and predicted class C with high probability instead. The resulting loss of 6.91 will reflect the larger error.

Pr(Class A) Pr(Class B) Pr(Class C) 0.001 0.001 0.998

p = np.array([0, 1, 0]) q = np.array([0.001, 0.001, 0.998]) print(-np.sum(p * np.log(q))) # 6.907755278982137

Now, what happens in the middle of these two extremes? Suppose your ML algorithm can't make up its mind and predicts the three classes with nearly equal probability.

Pr(Class A) Pr(Class B) Pr(Class C) 0.333 0.333 0.334

The resulting loss is 1.10.

p = np.array([0, 1, 0]) q = np.array([0.333, 0.333, 0.334]) print(-np.sum(p * np.log(q))) # 1.0996127890016931

Fitting into gradient descent

Cross-entropy is commonly used to quantify the difference between two probability distributions. Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

Where p(x) is the true probability distribution, and q(x) the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

So that is how "wrong" or "far away" your prediction is from the true distribution.

Cross-entropy is commonly used to quantify the difference between two probability distributions. In the context of machine learning, it is a measure of error for categorical multi-class classification problems. Usually the "true" distribution (the one that your machine learning algorithm is trying to match) is expressed in terms of a one-hot distribution.

Where p(x) is the true probability distribution (one-hot), and q(x) the predicted probability distribution. The sum is over the three classes A, B, and C. In this case the loss is 0.479 :

Note that it does not matter what logarithm base you use as long as you consistently use the same one. As it happens, the Python and Numpy log() functions compute the natural log (log base e).

Here is the above example expressed in Python using Numpy:

import numpy as np p = np.array([0, 1, 0]) # True probability (one-hot) q = np.array([0.228, 0.619, 0.153]) # Predicted probability cross_entropy_loss = -np.sum(p * np.log(q)) print(cross_entropy_loss) # 0.47965000629754095

So that is how "wrong" or "far away" your prediction is from the true distribution. A machine learning optimizer will attempt to minimize the loss (i.e. it will try to reduce the loss from 0.479 to 0.0).

Extreme examples

To gain more intuition on what these loss values reflect, let's look at some extreme examples.

Again, let's suppose the true (one-hot) distribution is:

Pr(Class A) Pr(Class B) Pr(Class C) 0.0 1.0 0.0

Now suppose your machine learning algorithm did a really great job and predicted class B with very high probability:

Pr(Class A) Pr(Class B) Pr(Class C) 0.001 0.998 0.001

When we compute the cross entropy loss, we can see that the loss is tiny, only 0.002:

p = np.array([0, 1, 0]) q = np.array([0.001, 0.998, 0.001]) print(-np.sum(p * np.log(q))) # 0.0020020026706730793

At the other extreme, suppose your ML algorithm did a terrible job and predicted class C with high probability instead. The resulting loss of 6.91 will reflect the larger error.

Pr(Class A) Pr(Class B) Pr(Class C) 0.001 0.001 0.998

p = np.array([0, 1, 0]) q = np.array([0.001, 0.001, 0.998]) print(-np.sum(p * np.log(q))) # 6.907755278982137

Now, what happens in the middle of these two extremes? Suppose your ML algorithm can't make up its mind and predicts the three classes with nearly equal probability.

Pr(Class A) Pr(Class B) Pr(Class C) 0.333 0.333 0.334

The resulting loss is 1.10.

p = np.array([0, 1, 0]) q = np.array([0.333, 0.333, 0.334]) print(-np.sum(p * np.log(q))) # 1.0996127890016931

Fitting into gradient descent

added 30 characters in body

Source Link

edited Oct 30, 2020 at 19:57

stackoverflowuser2010

41.5k
52
178
229

Loading

deleted 2 characters in body

Source Link

edited Mar 29, 2019 at 18:52

nbro

16.1k
34
122
219

Loading

added details on formulas variables + applied on example

Source Link

edit approved Jan 29, 2018 at 21:28

Omar Aflak

3k
1
24
39

Loading

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:26

URL Rewriter Bot

Loading

Source Link

answered Feb 1, 2017 at 22:21

stackoverflowuser2010

41.5k
52
178
229

Loading

Collectives™ on Stack Overflow

Return to Answer

Logarithm base

Python code

Logarithm base

Python code

ExtremeLoss units

More examples

Extreme examples

Loss units

More examples

Extreme examples

Fitting into gradient descent

Extreme examples

Fitting into gradient descent