1
$\begingroup$

In a linear regression model ${\bf y} = {\bf x}^T {\bf w} + \epsilon$, assuming Gaussian noise ($\epsilon\sim N(0,\sigma_n^2)$), and Gaussian priors on the weights (${\bf w}\sim N({\bf 0}, \Sigma_p))$, one can show that the posterior distribution for the weights given data $X,{\bf y}$ is also Gaussian, with: $$p({\bf w}|X,{\bf y})\sim N\left(\bar{\bf w}=\frac{1}{\sigma_n^2}A^{-1}X{\bf y}, A^{-1}\right)$$ Where $A = \sigma_n^{-2}X X^T + \Sigma_p^{-1}$. In order to use this posterior to find the probability distribution $p(f({\bf x_*}))$ at some new point $\bf x_*$, we need to average over all possible weights (sometimes termed the Bayesian estimator): $$p(f({\bf x_*})|{\bf x_*},X,{\bf y})=\int p(f({\bf x_*})|{\bf x_*},{\bf w})p({\bf w}|X,{\bf y})d\bf{w}$$ I'd like to prove that this is just the Gaussian: $$\sim N\left(\frac{1}{\sigma_n^2}{\bf x_*}^T A^{-1}X {\bf y}, {\bf x_*}^T A^{-1}{\bf x_*}\right)$$ However, something in the integration doesn't seem to work out.

Can someone show how this is done?

$\endgroup$

1 Answer 1

1
$\begingroup$

Unless I am missing something it should just be a straight forward application of the following

If $\mathbf{P}$ is a linear transformation and $$ \begin{align} p(\mathbf{w}) &=\mathcal{N}(\mathbf{w}|\boldsymbol{\mu},\boldsymbol{\Lambda}^{-1}) \\ p(\mathbf{y}|\mathbf{w}) &= \mathcal{N}(\mathbf{y}|\mathbf{Pw}+\mathbf{b}, \mathbf{L}^{-1}), \end{align} $$ then $$ p(\mathbf{y})=\mathcal{N}\left(\mathbf{y}|\mathbf{P}\boldsymbol{\mu}+\mathbf{b}, \mathbf{L}^{-1}+\mathbf{P}\boldsymbol{\Lambda}^{-1}\mathbf{P}^T\right). $$

Now apply the above to the case that $$ p(\mathbf{w}) = \mathcal{N}(\mathbf{w}|\sigma_n^{-1}\mathbf{A}^{-1}\mathbf{Xy}, \mathbf{A}^{-1}), $$ and so for this particular instance we have $$ \boldsymbol{\mu}:=\frac{1}{\sigma_n^2}\mathbf{A}^{-1}\mathbf{Xy}, \qquad \boldsymbol{\Lambda}:=\mathbf{A}, $$ and $$ \mathbf{P}:=(\mathbf{x}_*)^T, \qquad \mathbf{b}=\mathbf{0}, \qquad \mathbf{L}^{-1}:= \sigma_n^2\mathbf{I}. $$ Putting this together you get $$ p(\mathbf{f}^*)=\mathcal{N}\left(\mathbf{f}^* \; \bigg| \; \frac{1}{\sigma_n^2}\mathbf{x_*}^T\mathbf{A}^{-1}\mathbf{Xy}, \; \sigma_n^2\mathbf{I}+\mathbf{x}_*^T\mathbf{A}^{-1}\mathbf{x}_* \right), $$ note the presence of the independent noise term, $\sigma_n^2\mathbf{I}$, in the predictive distribution.

I have left the conditioning on $\mathbf{x}_*, \mathbf{X}, \mathbf{y}$ unstated but you can then go back and put that in by writing $p(\cdot) = p(\cdot | \mathbf{x}_*, \mathbf{X}, \mathbf{y})$ if you like.

As an aside I am a little confused in this instance as to whether $\mathbf{y}=\mathbf{y}(\mathbf{x})$, and therefore whether the noise $\epsilon$, intended to be vector or scalar valued but you can clean than up and make everything conform to the model you have in mind.

$\endgroup$
6
  • $\begingroup$ Were does one find the theorem you qouted? Do you know if there is a direct proof, or only one that goes through Bayes' theorem? $\endgroup$ Commented May 13, 2018 at 17:54
  • $\begingroup$ You can write $y=Pw + z$ with $z$ an independent Gaussian r.v. and then argue $p(y, x, z)$ are jointly Gaussian and then calculate moments $\endgroup$ Commented May 13, 2018 at 17:59
  • $\begingroup$ Or you could go directly with integration and rearranging the argument of the exponential term $\endgroup$ Commented May 13, 2018 at 18:02
  • $\begingroup$ Do you perhaps have a good reference book / paper on this and similar theorems? $\endgroup$ Commented May 13, 2018 at 18:40
  • $\begingroup$ What sort of level would you want the reference to be? There is discussion of this in Bishop' s Pattern Recog. and Machine Learning but maybe you want something a bit more in the lemma-proof style? $\endgroup$ Commented May 13, 2018 at 19:28

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.