This is something I have been trying to understand.
Consider the following local level linear state space model:
- Observation equation: $y_t = \mu_t + \epsilon_t$ where $\epsilon_t \sim N(0, \sigma_\epsilon^2)$
- State equation: $\mu_t = \mu_{t-1} + \eta_t$ where $\eta_t \sim N(0, \sigma_\eta^2)$
To estimate the parameters of this model, we need to optimize the following likelihood - this is the joint probability of getting the observations conditional on the hidden states and parameters multiplied by the probability of getting the states conditional on parameters (the integral is used to remove the influence of the states as they are unobserved):
$$L(\sigma_\epsilon^2, \sigma_\eta^2) = \int p(y_1, ..., y_T | \mu_1, ..., \mu_T, \sigma_\epsilon^2, \sigma_\eta^2) \cdot p(\mu_1, ..., \mu_T | \sigma_\eta^2) \, d\mu_1 ... d\mu_T$$
The first part of the integral involves a joint product of normal densities:
$$p(y_1, ..., y_T | \mu_1, ..., \mu_T, \sigma_\epsilon^2) = \prod_{t=1}^T \frac{1}{\sqrt{2\pi\sigma_\epsilon^2}} \exp\left(-\frac{(y_t - \mu_t)^2}{2\sigma_\epsilon^2}\right)$$
The second part also involves a joint product of normal densities (a prior is placed on the first observation for initialization) :
$$p(\mu_1, ..., \mu_T | \sigma_\eta^2) = p(\mu_1) \prod_{t=2}^T \frac{1}{\sqrt{2\pi\sigma_\eta^2}} \exp\left(-\frac{(\mu_t - \mu_{t-1})^2}{2\sigma_\eta^2}\right)$$
Together, the full joint likelihood becomes:
$$L(\sigma_\epsilon^2, \sigma_\eta^2) = \int \left[\prod_{t=1}^T \frac{1}{\sqrt{2\pi\sigma_\epsilon^2}} \exp\left(-\frac{(y_t - \mu_t)^2}{2\sigma_\epsilon^2}\right)\right]$$ $$\times \left[\frac{1}{\sqrt{2\pi\kappa^2}} \exp\left(-\frac{\mu_1^2}{2\kappa^2}\right) \prod_{t=2}^T \frac{1}{\sqrt{2\pi\sigma_\eta^2}} \exp\left(-\frac{(\mu_t - \mu_{t-1})^2}{2\sigma_\eta^2}\right)\right] d\mu_1 ... d\mu_T$$
My Question: Why can't this likelihood function be numerically optimized to obtain the estimates of the state space model? Why is the Kalman Filter typically used for this instead?
My guess is that in the 1960's, solving this likelihood function numerically was difficult given the limitations in computation power. The Kalman Filter has a simpler objective function and the computation is simplified as it relies on recursion.
Is this correct?