Let us consider a regression problem where a scalar target variable y must be predicted based on a vector of observables .
We assume that the dynamics are nonlinear and, specifically, that
where is a vector of unknown real parameters, f is a known deterministic function nonlinear in θ and ε is a random noise with distribution
for some positive and unknown value of σ.
If we have N independent observations , we can estimate the value of θ by maximizing the log-likelihood. We can optionally choose to weight some observations more or less that others by choosing weights
and assuming that
for each i.
Under these assumptions, the log-likelihood is given by
Setting for simplicity of notation
we see that maximizing the log-likelihood is equivalent to minimizing the following objective function (weighted sum of squared residuals):
Moreover, the maximum likelihood estimate for σ is
The Levenberg-Marquardt algorithm calculates the minimum of Obj in an iterative way calculating a series of local quadratic approximations.
The algorithm requires that an initial guess is provided for the unknown vector of parameters. The objective function Obj is then approximated locally in a neighbourhood of
with the following quadratic function (the approximation is only valid for small values of ||δ||:
The peculiarity here is that, thanks to the objective function's special form, we can calculate a local quadratic approximation by taking the first order expansion of f instead of the second-order expansion of the objective function itself (as we would be forced to do in the general case).
Defining for simplicity of notation
we have that this quadratic has a unique global minimum satisfying the equation
which is equivalent to requiring that the displacement δ solves the following linear system:
This picture illustrates the calculation of δ in a simple one-dimensional case:
In the picture, the displacement has been applied as it is. Actually, in practice, since the quadratic approximation is generally only valid locally, δ will just provide the displacement direction, while its module will be re-scaled according to a small positive number h (step) when updating θ:
The Levenberg-Marquardt algorithm can be extended to incorporate regularization terms. Seeing the problem in a Bayesian perspective, we can decide to provide a prior distribution on θ. For simplicity, we assume that the prior distributions on the different parameter components are independent, so that the global prior distribution can be factorized:
We assume moreover that the one-dimensional priors on the single components are either gaussian or lognormal:
Reasoning in Bayesian terms, this time we estimate θ via maximum posterior instead of maximum likelihood. The posterior distribution of θ is
Keeping the same notations as before, the objective function to minimize is now
(the term const incorporates terms that are independent of both θ and σ). This corresponds to minimizing a weighted sum of squared residuals plus a series of regularization terms.
The following contributions are added to φ and to the j-th component of its gradient due to the presence of prior distributions:
Gaussian prior:
Lognormal prior:
(We have used the first-order expansion of log in the lognormal case).
The linear system is now
where
and σ is replaced by its maximum likelihood estimate (see above).
