1
$\begingroup$

I have been trying to understand non-parametric regression using Gaussian processes (GP), which are used to represent prior distributions over the space of functions. The linear model considered is $$ \mathbf{y}_i = f(\mathbf{x}_i) + \mathbf{\epsilon}\\=f_i +\mathbf{\epsilon}$$ where $\mathbf{y}_i$ is the observations, $f_i$ is the output of the function at input location $\mathbf{x}_i$ and $\mathbf{\epsilon}$ is the additive Gaussian noise. The folowing GP is chosen in this case $$p(f|\mathbf{X},\mathbf{\theta}) = \mathcal{N}(\mathbf{0},k(\mathbf{X},\mathbf{X})).$$ where, $k(\cdot;\cdot)$ is the covariance function and $\mathbf{\theta}$ is the hyper-parameters.

I would appreciate if someone can shed light on the choice of this prior and what it actually does? Also, what is the approach to formulating the joint likelihood $p(\mathbf{Y},\mathbf{X},f,\mathbf{\theta})$ of the entire model.

$\endgroup$

1 Answer 1

1
$\begingroup$

There are various ways to choose the covariance kernel $k(\cdot, \cdot; \theta)$ and the hyperparameters $\theta$. Hence, the question per se is probably a bit too broad to be answered reasonably. However, I figured, I could give you an example of a class of covariance kernels that appears to be popular in Gaussian process regression and explain the influence of the hyper parameters for those.

Matérn class covariance kernels are given by $$k(x_1,x_2;\ell, \nu, \sigma) = \sigma^2\frac{2^{1-\nu}}{\Gamma(\nu)}\Bigg(\sqrt{2\nu}\frac{|x_1-x_2|}{\ell}\Bigg)^\nu K_\nu\Bigg(\sqrt{2\nu}\frac{|x_1-x_2|}{\ell}\Bigg),$$ where $K_\nu$ is the modified Bessel function of the second kind. Hence, the hyper parameter $\theta := (\ell, \nu, \sigma)$.

$\ell$ is the correlation length. If $\ell$ is small, the process at two points far apart is barely correlated. If $\ell$ is large, the process at two points far apart will still be (highly) correlated. Hence, if you assume that the function has a lot of variation in a small area of the domain, the correlation length should be rather short; otherwise long.

$\nu$ is the smoothness parameter. Samples from the prior will be $ceil(\nu)-1$ times differentiable. Hence, if you know that the function you try to determine is only continuous, but not differentiable, you should choose $\nu < 1$ and larger, if you assume more regularity. $\nu = 1/2$ corresponds to the exponential covariance, $\nu = \infty$ is the Gaussian covariance and produces samples that are infinitely often differentiable.

Each point of the Gaussian process is a normally distributed random variable. Under the prior, it has distribution $\mathrm{N}(0, \sigma^2)$. Hence, the standard deviation $\sigma$ controls the pointwise variance.

Concerning the joint likelihood. I am not really sure, what you mean. In my understanding the likelihood would be something like $p(Y|f)$. This is a function that relates the likelihood of the data to the parameter in the model that is supposed to be estimated. In this case, the parameter is the function $f$. If $\epsilon \sim \mathrm{N}(0, \gamma^2)$, the likelihood is $$ p(Y|f) = \exp\left(-\frac{1}{2 \epsilon^2} \sum_{i=1}^I (y_i - f(x_i))^2 \right).$$ The other parameters are hyperparameters of the prior and do not need to appear in the likelihood in this particular case.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.