Name	Name	Last commit message	Last commit date
Latest commit History 37 Commits
pictures	pictures
sample-data	sample-data
README.md	README.md
example.py	example.py
example_reg.py	example_reg.py
levenberg_marquardt.py	levenberg_marquardt.py

Levenberg-Marquardt algorithm for nonlinear regression

Nonlinear regression: generic problem statement

Let us consider a regression problem where a scalar target variable y must be predicted based on a vector of observables $\underline{x} \in \mathbb{R}^n$ .

We assume that the dynamics are nonlinear and, specifically, that

$y = f(\underline{x};\underline{\theta}) + \epsilon$

where $\underline{\theta} \in \mathbb{R}^k$ is a vector of unknown real parameters, f is a known deterministic function nonlinear in θ and ε is a random noise with distribution $\epsilon \sim N(0, \sigma^2)$ for some positive and unknown value of σ.

If we have N independent observations $(\underline{x}_1, y_1), ..., (\underline{x}_N, y_N)$ , we can estimate the value of θ by maximizing the log-likelihood. We can optionally choose to weight some observations more or less that others by choosing weights $w_i, i = 1, ..., n$ and assuming that $\small y_i \sim N(f(\underline{x}_i, \underline{\theta}), \frac{\sigma^2}{w_i})$ for each i.

Under these assumptions, the log-likelihood is given by

$L(\underline{\theta} | X, \underline{y}) = \prod_{i = 1}^N \frac{\sqrt{w_i}}{\sigma\sqrt{2\pi}}\exp \left \{ - \frac{1}{2} w_i \cdot \left( \frac{y_i-f(\underline{x}_i;\underline{\theta})}{\sigma} \right) ^ 2 \right \} \Rightarrow \newline \log L(\underline{\theta} | X, \underline{y}) = \frac{1}{2}\sum_{i=1}^N \log w_i -\frac{N}{2} \log(2\pi) - \frac{N}{2} \log(\sigma^2) - \frac{1}{2} \sum_{i=1}^N w_i \cdot \left( \frac{y_i-f(\underline{x}_i;\underline{\theta})}{\sigma} \right) ^ 2$

Setting for simplicity of notation

$\small X =\begin{bmatrix} \underline{x}_1 \\ ... \\ \underline{x}_N \\ \end{bmatrix} \in \mathbb{R}^{N \times n}, \hspace{10} \underline{y} = \begin{bmatrix} y_1 \\ ... \\ y_N \end{bmatrix} \in \mathbb{R}^N, \hspace{10} W = \text{diag} \left \{ \sqrt{w_1}, ..., \sqrt{w_N} \right \} \in \mathbb{R}^{N \times N}$

we see that maximizing the log-likelihood is equivalent to minimizing the following objective function (weighted sum of squared residuals):

$\text{Obj}(\underline{\theta}) = \frac{1}{2}\left \| W \cdot \left( \underline{y} - f(X, \underline{\theta}) \right) \right \| ^ 2$

Moreover, the maximum likelihood estimate for σ is

$\hat{\sigma}^2 = \frac{1}{N} \sum_{i=1}^N w_i (y_i -f(\underline{x_i}, \underline{\theta}))^2$ .

The Levenberg-Marquardt algorithm

The Levenberg-Marquardt algorithm calculates the minimum of Obj in an iterative way calculating a series of local quadratic approximations.

The algorithm requires that an initial guess $\underline{\theta}^{(0)}$ is provided for the unknown vector of parameters. The objective function Obj is then approximated locally in a neighbourhood of $\underline{\theta}^{(0)}$ with the following quadratic function (the approximation is only valid for small values of ||δ||:

$\small \begin{matrix} \text{Obj}(\underline{\theta}^{(0)} + \underline{\delta}, \sigma) \sim &\phi(\underline{\delta}) := \frac{N}{2}\log(\sigma^2) + \frac{1}{2\sigma^2} \left \| W \cdot \right( \underline{y} - f(X, \underline{\theta}^{(0)}) - J_{\underline{\theta}} f(X, \underline{\theta})_{|_{\underline{\theta} = \underline{\theta}^{(0)}}} \cdot \underline{\delta} \left)\; \right \| ^ 2 \end{matrix}$

The peculiarity here is that, thanks to the objective function's special form, we can calculate a local quadratic approximation by taking the first order expansion of f instead of the second-order expansion of the objective function itself (as we would be forced to do in the general case).

Defining for simplicity of notation

$J := J_{\underline{\theta}}f(X, \underline{\theta})_{|_{\underline{\theta} = \underline{\theta}^{(0)}}} \in \mathbb{R}^{N \times k}, \hspace{10} \underline{\epsilon} := \underline{y} - f(X, \underline{\theta}^{(0)}) \in \mathbb{R}^N$

we have that this quadratic has a unique global minimum satisfying the equation

$\nabla_{\underline{\delta}} \text{ } \phi(\underline{\delta}) = -\frac{1}{\sigma^2} J^T W^T \cdot (\underline{\epsilon} - W J\cdot \underline{\delta}) = \underline{0}$

which is equivalent to requiring that the displacement δ solves the following linear system:

$(J^T W^T W J) \cdot \underline{\delta} = J^T W^T \cdot \underline{\epsilon}$

This picture illustrates the calculation of δ in a simple one-dimensional case:

In the picture, the displacement has been applied as it is. Actually, in practice, since the quadratic approximation is generally only valid locally, δ will just provide the displacement direction, while its module will be re-scaled according to a small positive number h (step) when updating θ:

$\underline{\theta}^{(1)} = \underline{\theta}^{(0)} + h \cdot \frac{\underline{\delta}}{\left \| \underline{\delta} \right \|}$

Regularization

The Levenberg-Marquardt algorithm can be extended to incorporate regularization terms. Seeing the problem in a Bayesian perspective, we can decide to provide a prior distribution on θ. For simplicity, we assume that the prior distributions on the different parameter components are independent, so that the global prior distribution can be factorized:

$p(\underline{\theta}) = \prod_{j=1}^k p_j(\theta_j)$

We assume moreover that the one-dimensional priors on the single components are either gaussian or lognormal:

$\text{Gaussian prior: } p_j(\theta_j) = \sqrt{\frac{\beta_j}{2 \pi}} \cdot \exp{\left\{ -\frac{1}{2} \beta_j ( \theta_j - \mu_j)^2 \right\}}$

$\text{Lognormal prior: } p_j(\theta_j) = \frac{1}{\theta_j} \sqrt{\frac{\beta_j}{2 \pi}} \cdot \exp{ \left \{ -\frac{1}{2} \beta_j (\log\theta_j - \log\mu_j)^2 \right \} }$

Reasoning in Bayesian terms, this time we estimate θ via maximum posterior instead of maximum likelihood. The posterior distribution of θ is

$p(\underline{\theta} | X, \underline{y}) = \frac{p(\underline{y} | X, \underline{\theta}) \cdot p(\underline{\theta})}{p(\underline{y} | X)}$

(using the fact that $p(\underline{\theta} | X) = p(\underline{\theta})$ ).

Keeping the same notations as before, the objective function to minimize is now

$\text{Obj}(\underline{\theta}) = -\log p(\underline{\theta} | X, \underline{y}) = \newline \newline = \text{const} +\frac{N}{2} \log(\sigma^2) +\frac{1}{2\sigma^2} \left \| W \cdot (\underline{y} - f(X, \underline{\theta})) \right \|^2 + \newline \newline +\frac{1}{2} \cdot \sum_{\{j | p(\theta_j) \text{ gaussian}\}} \beta_j (\theta_j - \mu_j)^2 \newline \newline +\frac{1}{2} \cdot \sum_{\{j | p(\theta_j) \text{ lognormal}\}} \{ \log\theta_j + \beta_j (\log\theta_j - \log\mu_j)^2\}$

(the term const incorporates terms that are independent of both θ and σ). This corresponds to minimizing a weighted sum of squared residuals plus a series of regularization terms.

The following contributions are added to φ and to the j-th component of its gradient due to the presence of prior distributions:

Gaussian prior:

$\phi(\underline{\delta}) \text{: } \text{ } \frac{1}{2} \cdot \beta_j (\theta_j + \delta_j - \mu_j)^2 \newline \nabla_{\underline{\delta}} \phi(\underline{\delta})_j \text{: } \text{ } \beta_j \delta_j +\beta_j(\theta_j - \mu_j)$