1
$\begingroup$

I am trying to find the hyperparameters of a gaussian process regression algorithm using sklearn. The book (Rasmussen), says I should to maximize the log marginal likelihood given by $$\log(\mathbf{y}|X,\mathbf{\theta})=-\frac{1}{2} \mathbf{y}^TK_y^{-1}\mathbf{y}-\frac{1}{2}\log(\det(K))-\frac{n}{2}\log(2\pi)$$ So I start from a RBF kernel in sklearn with some parameters (can they be easy and random, say just both 1.0?) and then try to find the correct $\theta$? I don't understand this approach, should I do this for each label in my dataset in bulk? Or consider one point of my training set at a time and update the weights at each iteration? I apologise for the confused question, but can somebody explain how to start implementing this method?

$\endgroup$

1 Answer 1

0
$\begingroup$

The Gaussian process is a Bayesian model. It uses Bayesian updating, so it doesn’t matter if you process the data one sample at a time, or all at once, the result would be the same. There is no reason why you would tune the hyperparameters on a subsample of your data other than using held-out test set for validation.

$\endgroup$
7
  • $\begingroup$ Thanks for your reply. My misunderstanding is perhaps at an even more basic level: that optimization formula depends on a single label, so how is the hyperparameter tuning to happen in practice? I implement that formula for a single $y$ and then? How would I even do it for all my training labels "at once"? $\endgroup$ Commented Oct 18, 2021 at 18:26
  • $\begingroup$ @noesis you are optimizing some metric, for example mean squared error calculated over the whole data. This is the same no matter what ML model you use. $\endgroup$ Commented Oct 18, 2021 at 18:30
  • $\begingroup$ I am completely lost. I supposedly need to find the $\theta$ that minimizes that expression. Are you saying there is something else I should be minimizing? $\endgroup$ Commented Oct 18, 2021 at 18:34
  • $\begingroup$ @noesis in case of Gaussian process you could be maximizing marginal log-likelihood, sure. Still, this is a single number metric, aggregated over the whole dataset. $\endgroup$ Commented Oct 18, 2021 at 18:55
  • $\begingroup$ Ok, thank you. That formula depends on a single label $y$, so how do I implement this maximization in practice? Do it wrt one $y$, obtain new parameters for the kernel, and do it again on the next $y$ with the new kernel? How would I do it "in bulk"? $\endgroup$ Commented Oct 18, 2021 at 19:02

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.