What is the best way to find minima in Logistic regression?

Question

In the Andrew NG's tutorial on Machine Learning, he takes the first derivative of the error function then takes small steps in direction of the derivative to find minima. (Gradient Descent basically)

In the elements of statistical leaning book, it found the first derivative of the error function, equated it to zero then used numerical methods to find out the root (In this case Newton Raphson)

In paper both the methods should yield similar results. But numerically are they different or one method is better than the other ?

$\begingroup$ See also stats.stackexchange.com/questions/253632/… $\endgroup$

Sean Owen
– Sean Owen

2020-03-15 18:55:50 +00:00
Commented Mar 15, 2020 at 18:55 — Sean Owen
– Sean Owen, Commented Mar 15, 2020 at 18:55

Andrey Popov · Accepted Answer · 2020-03-15 19:07:23Z

Being a second-order method, Newton–Raphson algorithm is more efficient than the gradient descent if the inverse of the Hessian matrix of the cost function is known. However, inverting the Hessian, which is an $\mathcal O(n^3)$ operation, rapidly becomes impractical as the dimensionality of the problem grows. This issue is addressed in quasi-Newton methods, such as BFGS, but they don't handle mini-batch updates well and hence require the full dataset to be loaded in memory; see also this answer for a discussion. Later in his course Andrew Ng will discuss neural networks. They can easily contain millions of free parameters and be trained on huge datasets, and because of this variants of the gradient descent are usually used with them.

In short, the Newton–Raphson method can be faster for relatively small problems, but the gradient descent scales better with complexity of problem.

Sean Owen · Accepted Answer · 2020-03-15 18:54:19Z

SGD just requires the first derivative, but Newton-Raphson ends up requiring the second derivative, which might be hard or impossible to compute. It will also need more computation as a result.

For non-convex problems, note that the derivative is 0 at maxima as well as minima, and at saddle points, which (as I understand) takes a little more care to get right with numerical methods.

But yes for logistic regression, which is convex and for which second derivates are not complex, Newton's method would be fine too. I believe it's usually optimized with L-BFGS, not simply SGD, which is in a way more like Newton's method.

I think it was explained with SGD just to keep it simple.

Stack Exchange Network

What is the best way to find minima in Logistic regression?

2 Answers 2

Hot Network Questions

What is the best way to find minima in Logistic regression?

2 Answers 2

Related

Hot Network Questions