Can Saddle Points Provide "Better Solutions" to Machine Learning Models Compared to Local Minimums?
The solution to a Machine Learning model (i.e. the final model parameters) are selected by trying to optimize the Loss Function associated with that Machine Learning model. The "best" solution (i.e. "best" choice of model parameters) are those associated with the "global minimum" of this Loss Function (i.e. the smallest value of the Loss Function) - thus, "relatively better" solutions can be considered as solutions that are located closer to the "Global Minimum". Optimization Algorithms (e.g. Gradient Descent) try to search for the "Global Minimum" of the Loss Function by repeatedly searching in the direction of the derivatives corresponding to this Loss Function.
However, there are different obstacles than can occur during this search process. For instance:
- The Optimization Algorithm can get stuck in a "Local Minimum"
- The Optimization Algorithm can get stuck in a "Saddle Point"
I have heard "Saddle Points" as being considered "worse" than "Local Minimums" - this is because "Saddle Points" aren't actually a minimum of any sort, whereas "Local Minimums" are at least minimums at the local level. Thus, this would imply that model parameters chosen from a "Saddle Point" should be worse than model parameters chosen from a "Local Minimum". To further illustrate my question, I drew the following graph of a hypothetical Loss Function for some Machine Learning model:
In the above picture, we can see that Loss Function has a smaller loss at the "Saddle Point" compared to the loss at the "Local Minimum". Thus, in this case - (Assuming that we could not reach "P3") if we had to choose a selection of model parameters from "P2" ("Saddle Point") and "P1" ("Local Minimum") - it would clearly make more sense to pick model parameters from "P2".
My Question: In general, do we know if solutions corresponding to "Local Minimums" points on a Loss Function are considered to be "better" corresponding to "Saddle Points" (e.g. perhaps solutions from "Local Minimums" might be more "stable")? Or is this claim incorrect, and solutions corresponding to regions of the Loss Function with lower Loss values are generally "better" - regardless of whether they come from a "Saddle Point" or a "Local Minimum"?
Thanks!
