Why do we need stochastic gradient descent when we can precompute certain results in batch gradient descent?

Question

I have read that batch gradient descent forces this summation at every step of the update, which makes it time consuming.

But if we have the following hypothesis function: $$h(x^i) = w_0 + w_1x^i$$

Then the update rules become:

$$w_1 = w_1 - \alpha/N\sum_{i=1}^n (y^i - w_0 - w_1x^i)x^i$$

$$w_0 = w_0 - \alpha/N\sum_{i=1}^n (y^i - w_0 - w_1x^i)$$

which can be represented as:

$$w_1 = w_1 - \alpha/N(\sum_{i=1}^n y^i*x^i - w_0\sum_{i=1}^nx^i - w_1\sum_{i=1}^nx^i*x^i)$$

and

$$w_0 = w_0 - \alpha/N(\sum_{i=1}^n y^i*x^i - N*w_0 - w_1\sum_{i=1}^nx^i)$$

In this rule, the following can be precomputed

$$\sum_{i=1}^n y^i*x^i$$

$$\sum_{i=1}^n x^i$$

$$\sum_{i=1}^n x^i*x^i$$

So we don't need to perform the summation at every individual update of the parameter.

Then why do we need stochastic gradient descent?

The model $h(x)$ is just linear regression. If you're fitting a linear regression, then why are you using iterative methods at all? — Sycorax
– Sycorax ♦, Commented Nov 1, 2022 at 14:36
@Sycorax I used this example because this is what is used to explain gradient descent everywhere. — sage76
– sage76, Commented Nov 1, 2022 at 14:59
The typical application of SGD is in neural-networks with 1 or more hidden layers, which do not reduce to these types of simple expressions due to the application of activation functions. While these examples can be useful to give a flavor for what SGD looks like, they are not useful for understanding realistic use-cases. — Sycorax
– Sycorax ♦, Commented Nov 1, 2022 at 15:02

Sycorax · Accepted Answer · 2022-11-01 18:32:14Z

Turning my comments into an answer...

The model $h(x)$ is just linear regression. The gradient updates imply that this is an OLS regression. If you're fitting an OLS, then you have no need for iterative methods at all. If the design matrix $X$ is full-rank, then you can use the normal equations to estimate the coefficients. If $X$ is not full-rank, then a penalized regression strategy such as ridge regression could be used. Or you can use a psuedo-inverse.

The typical application of SGD is in neural-networks with 1 or more hidden layers, which do not reduce to these types of simple expressions due to the use of activation functions. While the question's example of what SGD looks like can be useful to give a flavor for what SGD does, it's not a complete picture because it neglects the consequences of using a more powerful model.

Stack Exchange Network

Why do we need stochastic gradient descent when we can precompute certain results in batch gradient descent?

1 Answer 1

Linked

Hot Network Questions

Why do we need stochastic gradient descent when we can precompute certain results in batch gradient descent?

1 Answer 1

Linked

Related

Hot Network Questions