Return to Answer

deleted 5 characters in body

edited Nov 6, 2020 at 23:45

5.3k
2
18
39

Given the likelihood function $P(\mathcal{D};\mu,\sigma^2)$, where $\mathcal{D}$ is the dataset, $\mu$ is the mean and $\sigma^2$ is the variance, gradient descent first involves combining the parameters $\mu$ and $\sigma^2$ into a parameter vector $\theta=[\mu,\sigma^2]^T$ such that:

$$P(\mathcal{D};\mu,\sigma^2) = P(\mathcal{D};\theta)$$

Then, the parameter vector $\theta$ is updated as follows:

$$\theta_{t+1} \leftarrow \theta_t + \epsilon \nabla_{\theta} P(\mathcal{D};\theta)$$

Where $\epsilon$ is the learning rate, and $\nabla_{\theta} P(\mathcal{D};\theta)$ is the gradient of the likelihood function $P(\mathcal{D};\theta)$ with respect to $\theta$. Both coordinate descent and gradient descent are equivalent for a single parameter, but for multiple parameters, updating the parameters individually corresponds to coordinate descent, while updating them simultaneously after combining them into a parameter vector is equivalent to gradient descent. Please see this Jupyter notebook that I made that explains the process of estimating these parameters both analytically and using gradient descent. Try to open the notebook locally as GitHub does not render notebooks that well.

Hope this helps.

Given the likelihood function $P(\mathcal{D};\mu,\sigma^2)$, where $\mathcal{D}$ is the dataset, $\mu$ is the mean and $\sigma^2$ is the variance, gradient descent first involves combining the parameters $\mu$ and $\sigma^2$ into a parameter vector $\theta=[\mu,\sigma^2]^T$ such that:

$$P(\mathcal{D};\mu,\sigma^2) = P(\mathcal{D};\theta)$$

Then, the parameter vector $\theta$ is updated as follows:

$$\theta_{t+1} \leftarrow \theta_t + \epsilon \nabla_{\theta} P(\mathcal{D};\theta)$$

Where $\epsilon$ is the learning rate, and $\nabla_{\theta} P(\mathcal{D};\theta)$ is the gradient of the likelihood function $P(\mathcal{D};\theta)$ with respect to $\theta$. Both coordinate descent and gradient descent are equivalent for a single parameter, but for multiple parameters, updating the parameters individually corresponds to coordinate descent, while updating them simultaneously after combining them into a parameter vector is equivalent to gradient descent. Please see this Jupyter notebook that I made that explains the process of estimating these parameters both analytically and using gradient descent. Try to open the notebook locally as GitHub does not render notebooks that well.

Hope this helps.

Given the likelihood function $P(\mathcal{D};\mu,\sigma^2)$, where $\mathcal{D}$ is the dataset, $\mu$ is the mean and $\sigma^2$ is the variance, gradient descent first involves combining the parameters $\mu$ and $\sigma^2$ into a parameter vector $\theta=[\mu,\sigma^2]^T$ such that:

$$P(\mathcal{D};\mu,\sigma^2) = P(\mathcal{D};\theta)$$

Then, the parameter vector $\theta$ is updated as follows:

$$\theta_{t+1} \leftarrow \theta_t + \epsilon \nabla_{\theta} P(\mathcal{D};\theta)$$

Where $\epsilon$ is the learning rate, and $\nabla_{\theta} P(\mathcal{D};\theta)$ is the gradient of the likelihood function $P(\mathcal{D};\theta)$ with respect to $\theta$. Both coordinate descent and gradient descent are equivalent for a single parameter, but for multiple parameters, updating the parameters individually corresponds to coordinate descent, while updating them simultaneously after combining them into a parameter vector is equivalent to gradient descent. Please see this Jupyter notebook that I made that explains the process of estimating these parameters both analytically and using gradient descent. Try to open the notebook locally as GitHub does not render notebooks well.

Hope this helps.

answered Nov 6, 2020 at 23:29

5.3k
2
18
39

Given the likelihood function $P(\mathcal{D};\mu,\sigma^2)$, where $\mathcal{D}$ is the dataset, $\mu$ is the mean and $\sigma^2$ is the variance, gradient descent first involves combining the parameters $\mu$ and $\sigma^2$ into a parameter vector $\theta=[\mu,\sigma^2]^T$ such that:

$$P(\mathcal{D};\mu,\sigma^2) = P(\mathcal{D};\theta)$$

Then, the parameter vector $\theta$ is updated as follows:

$$\theta_{t+1} \leftarrow \theta_t + \epsilon \nabla_{\theta} P(\mathcal{D};\theta)$$

Where $\epsilon$ is the learning rate, and $\nabla_{\theta} P(\mathcal{D};\theta)$ is the gradient of the likelihood function $P(\mathcal{D};\theta)$ with respect to $\theta$. Both coordinate descent and gradient descent are equivalent for a single parameter, but for multiple parameters, updating the parameters individually corresponds to coordinate descent, while updating them simultaneously after combining them into a parameter vector is equivalent to gradient descent. Please see this Jupyter notebook that I made that explains the process of estimating these parameters both analytically and using gradient descent. Try to open the notebook locally as GitHub does not render notebooks that well.

Hope this helps.