In gradient descent for neural networks, we optimize over a loss surface defined by our loss function L(W) where W represents the network weights. However, since there are infinitely many possible weight configurations, we can never compute or store the complete geometric surface of this loss function.
This raises a question: What exactly are we optimizing over if we only ever compute point-wise evaluations of the loss? How can we meaningfully talk about descending a surface that we never fully construct?
I understand that at each step we can:
- Compute the loss at our current weights
- Compute the gradient at that point
- Take a step in the direction of steepest descent
But I'm struggling to understand the geometric/mathematical meaning of optimizing over an implicit surface that we never fully realize. What is the theoretical foundation for this?