Linked Questions

Question 1

I am training a model (Recurrent Neural Network) to classify 4 types of sequences. As I run my training I see the training loss going down until the point where I correctly classify over 90% of the ...

Question 2

Is the following statement true: Gradient descent is guaranteed to always decrease a loss function. I know that if the loss function is convex, then each iteration of gradient descent will result in ...

Question 3

When performing stochastic gradient descent, it is necessary for the training loss to decrease a) between iterations in an epoch? (I think the answer is no) b) between epochs? (I think the answer is ...

Question 4

I'm currently trying to get the basics of Pytorch, playing around with simple networks topologies for the fashion-MNIST dataset. However, when I record the loss of those models after each epochs, it ...

Question 5

When learning about Neural Networks and Gradient Descent, we are often shown the following picture that illustrates the obstacles that can be encountered when trying to optimize the Loss Functions ...

Question 6

I am trying to use Hugginface Datasets for speech recognition using transformers using this tutorial, epochs=30, steps=400, train_batch_size=16. Training loss, validation loss and WER decrease, and ...

Question 7

If a gradient points towards a max or a min what stops gradient descent from maximizing error instead of minimizing it? Is it the nature of the update step that makes this process one way?

Question 8

I have created a neural network that feeds an image into a convolutional neural net, then feeds the flattened output of this network into an artificial neural network. I have a feeling that my ...

Question 9

I'm training a neural network but the training loss doesn't decrease. How can I fix this? I'm not asking about overfitting or regularization. I'm asking about how to solve the problem where my network'...

Question 10

Given a convex cost function, using SGD for optimization, we will have a gradient (vector) at a certain point during the optimization process. My question is, given the point on the convex, does the ...

Question 11

I guess this is a basic question and it has to do with the direction of the gradient itself, but I'm looking for examples where 2nd order methods (e.g. BFGS) are more effective than simple gradient ...

Question 12

I am trying to understand gradient descent optimization in ML(machine learning) algorithms. I understand that there's a cost function—where the aim is to minimize the error $\hat y-y$. In a ...

Question 13

I've implemented my own gradient descent algorithm for an OLS, code below. It work's, however, when the learning rate is too large (i.e. learn_rate >= .3), my approach is unstable. The coefficient's ...

Question 14

I've noticed in different papers that after a certain number of epochs there sometimes is a sudden drop in error rate when training a CNN. This example is taken from the "Densely Connected ...

Question 15

I just wondered if there are cases where small or very small learning rates in gradient descent based optimization are useful? A large learning rate allows the model to explore a much larger portion ...

Stack Exchange Network

Linked Questions

Training loss increases with time [duplicate]

Gradient descent decreasing loss [duplicate]

Is training loss guaranteed to decrease for stochastic gradient descent? [duplicate]

PyTorch - Error going up [duplicate]

Can Gradient Descent "Bounce Around" Forever? [duplicate]

Training loss, validation loss and WER decrease, then increase [duplicate]

What stops gradient descent from finding the largest error? [duplicate]

The result of back propagation for a neural network [duplicate]

What should I do when my neural network doesn't learn?

For convex problems, does gradient in Stochastic Gradient Descent (SGD) always point at the global extreme value?

Why are second-order derivatives useful in convex optimization?

Gradient descent optimization

Gradient descent explodes if learning rate is too large

What is the cause of the sudden drop in error rate that one often sees when training a CNN? [duplicate]

When are very small learning rates useful?

Hot Network Questions