Skip to main content
added 65 characters in body
Source Link
Sycorax
  • 95.8k
  • 23
  • 246
  • 405

In chapter 1 of Nielsen's Neural Networks and Deep Learning it says

To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. If we don't, we might end up with $ΔC>0$$\Delta C>0$, which obviously would not be good! At the same time, we don't want $η$$\eta$ to be too small, since that will make the changes $Δv$$\Delta v$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $η$$\eta$ is often varied so that Equation (9) remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.

But just a few paragraphs before we established that $ΔC≈−η∇C⋅∇C=−η‖∇C‖^2$$\Delta C\approx−\eta\nabla C⋅\nabla C=−\eta\|\nabla C\|^2$ is obviously always negative (for positive $η$$\eta$). So how can $ΔC$$\Delta C$ be positive if we don't choose a small enough learning rate? What is meant there?

In chapter 1 of Nielsen's Neural Networks and Deep Learning it says

To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. If we don't, we might end up with $ΔC>0$, which obviously would not be good! At the same time, we don't want $η$ to be too small, since that will make the changes $Δv$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $η$ is often varied so that Equation (9) remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.

But just a few paragraphs before we established that $ΔC≈−η∇C⋅∇C=−η‖∇C‖^2$ is obviously always negative (for positive $η$). So how can $ΔC$ be positive if we don't choose a small enough learning rate? What is meant there?

In chapter 1 of Nielsen's Neural Networks and Deep Learning it says

To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. If we don't, we might end up with $\Delta C>0$, which obviously would not be good! At the same time, we don't want $\eta$ to be too small, since that will make the changes $\Delta v$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $\eta$ is often varied so that Equation (9) remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.

But just a few paragraphs before we established that $\Delta C\approx−\eta\nabla C⋅\nabla C=−\eta\|\nabla C\|^2$ is obviously always negative (for positive $\eta$). So how can $\Delta C$ be positive if we don't choose a small enough learning rate? What is meant there?

edited body
Source Link
fabiomaia
  • 639
  • 5
  • 16

In chapter 1 of Nielsen's Neural Networks and Deep Learning it says

To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. If we don't, we might end up with $ΔC>0$, which obviously would not be good! At the same time, we don't want $η$ to be too small, since that will make the changes $Δv$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $η$ is often varied so that Equation (9) remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.

But just a few paragraphs before we established that $ΔC≈−η∇C⋅∇C=−η‖∇C‖^2$ is obviously always positivenegative (for positive $η$). So how can $ΔC$ be positive if we don't choose a small enough learning rate? What is meant there?

In chapter 1 of Nielsen's Neural Networks and Deep Learning it says

To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. If we don't, we might end up with $ΔC>0$, which obviously would not be good! At the same time, we don't want $η$ to be too small, since that will make the changes $Δv$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $η$ is often varied so that Equation (9) remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.

But just a few paragraphs before we established that $ΔC≈−η∇C⋅∇C=−η‖∇C‖^2$ is obviously always positive (for positive $η$). So how can $ΔC$ be positive if we don't choose a small enough learning rate? What is meant there?

In chapter 1 of Nielsen's Neural Networks and Deep Learning it says

To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good approximation. If we don't, we might end up with $ΔC>0$, which obviously would not be good! At the same time, we don't want $η$ to be too small, since that will make the changes $Δv$ tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, $η$ is often varied so that Equation (9) remains a good approximation, but the algorithm isn't too slow. We'll see later how this works.

But just a few paragraphs before we established that $ΔC≈−η∇C⋅∇C=−η‖∇C‖^2$ is obviously always negative (for positive $η$). So how can $ΔC$ be positive if we don't choose a small enough learning rate? What is meant there?

edited tags
Link
Sycorax
  • 95.8k
  • 23
  • 246
  • 405
Source Link
fabiomaia
  • 639
  • 5
  • 16
Loading