I am trying to find the hyperparameters of a gaussian process regression algorithm using sklearn. The book (Rasmussen), says I should to maximize the log marginal likelihood given by $$\log(\mathbf{y}|X,\mathbf{\theta})=-\frac{1}{2} \mathbf{y}^TK_y^{-1}\mathbf{y}-\frac{1}{2}\log(\det(K))-\frac{n}{2}\log(2\pi)$$ So I start from a RBF kernel in sklearn with some parameters (can they be easy and random, say just both 1.0?) and then try to find the correct $\theta$? I don't understand this approach, should I do this for each label in my dataset in bulk? Or consider one point of my training set at a time and update the weights at each iteration? I apologise for the confused question, but can somebody explain how to start implementing this method?
1 Answer
$\begingroup$ $\endgroup$
7 The Gaussian process is a Bayesian model. It uses Bayesian updating, so it doesn’t matter if you process the data one sample at a time, or all at once, the result would be the same. There is no reason why you would tune the hyperparameters on a subsample of your data other than using held-out test set for validation.
- $\begingroup$ Thanks for your reply. My misunderstanding is perhaps at an even more basic level: that optimization formula depends on a single label, so how is the hyperparameter tuning to happen in practice? I implement that formula for a single $y$ and then? How would I even do it for all my training labels "at once"? $\endgroup$noesis– noesis2021-10-18 18:26:27 +00:00Commented Oct 18, 2021 at 18:26
- $\begingroup$ @noesis you are optimizing some metric, for example mean squared error calculated over the whole data. This is the same no matter what ML model you use. $\endgroup$Tim– Tim2021-10-18 18:30:24 +00:00Commented Oct 18, 2021 at 18:30
- $\begingroup$ I am completely lost. I supposedly need to find the $\theta$ that minimizes that expression. Are you saying there is something else I should be minimizing? $\endgroup$noesis– noesis2021-10-18 18:34:09 +00:00Commented Oct 18, 2021 at 18:34
- $\begingroup$ @noesis in case of Gaussian process you could be maximizing marginal log-likelihood, sure. Still, this is a single number metric, aggregated over the whole dataset. $\endgroup$Tim– Tim2021-10-18 18:55:58 +00:00Commented Oct 18, 2021 at 18:55
- $\begingroup$ Ok, thank you. That formula depends on a single label $y$, so how do I implement this maximization in practice? Do it wrt one $y$, obtain new parameters for the kernel, and do it again on the next $y$ with the new kernel? How would I do it "in bulk"? $\endgroup$noesis– noesis2021-10-18 19:02:52 +00:00Commented Oct 18, 2021 at 19:02