How can gradient descent optimize a loss surface that's never fully computed?

Question

In gradient descent for neural networks, we optimize over a loss surface defined by our loss function L(W) where W represents the network weights. However, since there are infinitely many possible weight configurations, we can never compute or store the complete geometric surface of this loss function.

This raises a question: What exactly are we optimizing over if we only ever compute point-wise evaluations of the loss? How can we meaningfully talk about descending a surface that we never fully construct?

I understand that at each step we can:

Compute the loss at our current weights
Compute the gradient at that point
Take a step in the direction of steepest descent

But I'm struggling to understand the geometric/mathematical meaning of optimizing over an implicit surface that we never fully realize. What is the theoretical foundation for this?

cinch · Accepted Answer · 2025-02-15 08:20:55Z

We're not actually seeing or storing the whole loss surface which might be an infinitely complex landscape, instead, gradient descent relies only on the local geometry of the loss function. By iteratively updating $W$ using these local linear approximations, gradient descent “descends” the surface without ever needing to construct it globally and explicitly. This idea is rooted in fundamental calculus and optimization theory, where we optimize a function by following its local first-order derivatives rather than by examining the entire function at once.

Of course in this way it may stuck in many possibly shallow suboptimal local minimums or saddle points and that's the case for many deep learning models with complex non-convex loss function $L(W)$, so we usually use stochastic gradient descent and many other optimization techniques to mitigate this issue. In theory only for convex function gradient descent can ensure to reach global minimum.

Stack Exchange Network

How can gradient descent optimize a loss surface that's never fully computed?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

How can gradient descent optimize a loss surface that's never fully computed?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions