1
$\begingroup$

Can Saddle Points Provide "Better Solutions" to Machine Learning Models Compared to Local Minimums?

The solution to a Machine Learning model (i.e. the final model parameters) are selected by trying to optimize the Loss Function associated with that Machine Learning model. The "best" solution (i.e. "best" choice of model parameters) are those associated with the "global minimum" of this Loss Function (i.e. the smallest value of the Loss Function) - thus, "relatively better" solutions can be considered as solutions that are located closer to the "Global Minimum". Optimization Algorithms (e.g. Gradient Descent) try to search for the "Global Minimum" of the Loss Function by repeatedly searching in the direction of the derivatives corresponding to this Loss Function.

However, there are different obstacles than can occur during this search process. For instance:

  • The Optimization Algorithm can get stuck in a "Local Minimum"
  • The Optimization Algorithm can get stuck in a "Saddle Point"

I have heard "Saddle Points" as being considered "worse" than "Local Minimums" - this is because "Saddle Points" aren't actually a minimum of any sort, whereas "Local Minimums" are at least minimums at the local level. Thus, this would imply that model parameters chosen from a "Saddle Point" should be worse than model parameters chosen from a "Local Minimum". To further illustrate my question, I drew the following graph of a hypothetical Loss Function for some Machine Learning model:

enter image description here

In the above picture, we can see that Loss Function has a smaller loss at the "Saddle Point" compared to the loss at the "Local Minimum". Thus, in this case - (Assuming that we could not reach "P3") if we had to choose a selection of model parameters from "P2" ("Saddle Point") and "P1" ("Local Minimum") - it would clearly make more sense to pick model parameters from "P2".

My Question: In general, do we know if solutions corresponding to "Local Minimums" points on a Loss Function are considered to be "better" corresponding to "Saddle Points" (e.g. perhaps solutions from "Local Minimums" might be more "stable")? Or is this claim incorrect, and solutions corresponding to regions of the Loss Function with lower Loss values are generally "better" - regardless of whether they come from a "Saddle Point" or a "Local Minimum"?

Thanks!

$\endgroup$
2
  • $\begingroup$ Rather than writing "Comparing Solutions from Saddle Points vs. Local Minimums", put your specific question in the title. Note "Comparing Solutions from Saddle Points vs. Local Minimums" is not a question and it's not specific. $\endgroup$ Commented Feb 8, 2022 at 9:32
  • $\begingroup$ Ok, can you do it then? See ai.meta.stackexchange.com/q/1654/2444 in case this request does not make much sense to you. $\endgroup$ Commented Feb 15, 2022 at 22:16

1 Answer 1

1
$\begingroup$

Your interpretation of the graph is correct, in this simple example weight taken from the saddle point will perform better than the samples taken at the local minima. Said tat, this 1D example is a bit of an outlier compared to what happen in more realistic scenarios.

If we keep in mind the second partial derivative test, used to check if a point of a function is a maximum, minimum or saddle point, then it's easy to see that especially in higher dimensions the probability of a point being a saddle one are higher than being a maximum or a minimum one.

When analyzing a point we first compute the determinant of the Hessian matrix of our loss function (I'm using the 2 variables formulation for convenience):

$H(x, y)=f_{xx}(x, y)f_{yy}(x, y) - (f_{xy}(x,y))^{2}$

if $H(x,y)<0$ then the point is a saddle point. Otherwise if $H(x,y)>0$ it could be a maximum or minimum. Notice that $f_{xx}(x, y)$ tells us the direction of concavity in the x direction, whereas $f_{yy}(x, y)$ tells us the direction of the concavity in the y direction. This makes sense, if the concavities points on different directions for different dimensions we can't have a maximum or a minimum (see picture below, positive direction for y axis, negative for x axis), and indeed the resulting score is negative, indicating a saddle point.

The take home point is that a local minima compared to a saddle point respects the property of sharing the same negative direction across all dimensions of the function. This implies that, as the number of dimensions increases (and with real models we can easily start from thousands of them) the probability of a point being a saddle point increases as well, while the probability of being a local or global minima decreases.

So in a more realistic case, just for a matter of probability, a solution associated with a random local minima has lot of chances to be better than a solution associated to a random saddle point.

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.