Revisions to How can change in cost function be positive?

added 65 characters in body

edited Aug 31, 2018 at 18:02

95.8k
23
246
405

In chapter 1 of Nielsen's Neural Networks and Deep Learning it says

To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. If we don't, we might end up with $ΔC>0$$\Delta C>0$, which obviously would not be good! At the same time, we don't want $η$$\eta$ to be too small, since that will make the changes $Δv$$\Delta v$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $η$$\eta$ is often varied so that Equation (9) remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.

But just a few paragraphs before we established that $ΔC≈−η∇C⋅∇C=−η‖∇C‖^2$$\Delta C\approx−\eta\nabla C⋅\nabla C=−\eta\|\nabla C\|^2$ is obviously always negative (for positive $η$$\eta$). So how can $ΔC$$\Delta C$ be positive if we don't choose a small enough learning rate? What is meant there?

edited body

Source Link

edited Aug 29, 2018 at 14:06

fabiomaia

639
5
16

In chapter 1 of Nielsen's Neural Networks and Deep Learning it says

To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. If we don't, we might end up with $ΔC>0$, which obviously would not be good! At the same time, we don't want $η$ to be too small, since that will make the changes $Δv$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $η$ is often varied so that Equation (9) remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.

But just a few paragraphs before we established that $ΔC≈−η∇C⋅∇C=−η‖∇C‖^2$ is obviously always positivenegative (for positive $η$). So how can $ΔC$ be positive if we don't choose a small enough learning rate? What is meant there?