Linked Questions
1 vote
0 answers
303 views
How does Feature Scaling help Gradient Descent? [duplicate]
I am following deep learning.ai's videos on Coursera. I have a couple of questions about feature scaling using the formula: $$ (x - \mu)/ \sigma $$ Edit: There are similar questions which deal with ...
0 votes
0 answers
105 views
The result of back propagation for a neural network [duplicate]
I have created a neural network that feeds an image into a convolutional neural net, then feeds the flattened output of this network into an artificial neural network. I have a feeling that my ...
26 votes
4 answers
10k views
Why are second-order derivatives useful in convex optimization?
I guess this is a basic question and it has to do with the direction of the gradient itself, but I'm looking for examples where 2nd order methods (e.g. BFGS) are more effective than simple gradient ...
7 votes
1 answer
4k views
How can change in cost function be positive?
In chapter 1 of Nielsen's Neural Networks and Deep Learning it says To make gradient descent work correctly, we need to choose the learning rate η to be small enough that Equation (9) is a good ...
6 votes
3 answers
2k views
Basic preconditioned gradient descent example
I'm exploring preconditioned gradient descent using a similar toy problem described in the first part of Lecture 8: Accelerating SGD with preconditioning and adaptive learning rates. I have the ...
1 vote
3 answers
2k views
Is there a reason we need to make a logistic regression linear using the logit?
My understanding is that we use the logit function to convert the sigmoidal curve of a logistic regression to be linear. As a result, we go from a curve modeled as P = ea+bX / (1 + ea+bX) to one that ...
7 votes
1 answer
1k views
Is there any paper which summarizes the mathematical foundation of deep learning? [closed]
Is there any paper which summarizes the mathematical foundation of deep learning? Now, I am studying about the mathematical background of deep learning. However, unfortunately I cannot know to what ...
9 votes
1 answer
1k views
How to deal with unstable estimates during curve fitting?
First of all, I understand this isn't a strictly statistical question, but I've seen other questions involving optim() here. Please feel free to suggest a better SE ...
3 votes
2 answers
716 views
Problems, which are difficult for SGD
I am doing some research on problems, for which the stochastic gradient descent doesn't perform well. Often SGD is mentioned as the best method for the training of neural networks. However, I've also ...
1 vote
1 answer
209 views
Can we apply analyticity of a neural network to improve upon gradient descent? [duplicate]
Gradient descent uses the first order derivative information of the objective function as a function of the parameters. Gradient descent therefore uses only “local” information about the objective ...
1 vote
1 answer
335 views
Why does gradient descent HAVE to find the minimum as oppose to a change in the opposite direction
I have a question about the gradient descent step in neural networks. I fully understand the derivative step and taking the steps required to move in the direction that reduces the loss (finding the ...
1 vote
1 answer
447 views
alternating negative and positive value of slope and y-intercept in gradient descent
I'm working with the following code for gradient descent for simple linear regression: ...
0 votes
1 answer
324 views
Interpreting cost change plot in a neural network for learning XOR
I tried to build a neural net for learning XOR. The design is as follows: 1st layer: compute linear function of input 4:2 with 2:2 weights and adding 1:2 bias. 2nd layer: apply sigmoid to all ...
3 votes
1 answer
163 views
how does a neural network with stochastic backpropagation make sure it doesn't "undo" previous learning?
Assume we have a neural network with stochastic gradient descent used for backpropagation, and therefore each element in the training set is used once to calculate the error, and then to adjust the ...
2 votes
1 answer
156 views
Gradient Descent Rule in feedforward ANN
I am having a hard time understanding the Gradient Descent Rule for learning in a feedforward ANN. In particular, how do we determine the initial weight vector, and how is this weight vector adjusted ...
0 votes
1 answer
111 views
What stops gradient descent from finding the largest error? [duplicate]
If a gradient points towards a max or a min what stops gradient descent from maximizing error instead of minimizing it? Is it the nature of the update step that makes this process one way?