2
$\begingroup$

In Bayesian Optimization, the function (i.e. objective function) that we are trying to optimize is modelled using some surrogate function - this surrogate function usually turns out to be a Gaussian Process.

I am trying to better understand - in Bayesian Optimization, why exactly do we decide to model the objective function using a Gaussian Process? Could we have based the surrogate on some other class of functions instead?

My guess is that perhaps Gaussian Processes have a certain type of "mathematical flexibility" which makes them an ideal choice for a surrogate models in the context of optimization. But in the end - is there some specific reason as to why Gaussian Processes are used in Bayesian Optimization (compared to some other class of surrogate function) ?

Thanks!

$\endgroup$
3
  • $\begingroup$ Not a direct answer to your question, but fwiw this lecture gives a lot of nice intuition about gaussian processes: youtu.be/92-98SYOdlY $\endgroup$ Commented May 4, 2022 at 3:08
  • $\begingroup$ Thank you! I will check this out! $\endgroup$ Commented May 4, 2022 at 3:17
  • $\begingroup$ GPs are nice because the model is simple and well understood. But they aren't scalable enough for large datasets - $O(n^3)$ cost! $\endgroup$ Commented May 14, 2022 at 20:45

2 Answers 2

2
+50
$\begingroup$

Gaussian distributions are maximum entropy distributions and therefore minimize the amount of prior knowledge that is being put into the assumption. In other words, they are a good choice when modeling a high degree of uncertainty. To my understanding there is nothing beyond that. And yes, it could be theoretically correct to base the surrogate on other functions, provided that those functions fit some notion of prior knowledge.

There is a little bit of information about this in Goodfellow's Deep Learning.

EDIT: There are many kinds of probability distributions. Depending on the context, you will model an experiment (or some data) with a discrete model (a coin toss, for example), a continuous model (if you are evaluating weight, for example) or some more general probability measure. (If curious about probability measures, the first chapters in Jun Shao's Mathematical Statistics are a good choice of study.)

$A)$ We define the self-information of an event $\text{x}=x$ as $I(x)=-\log P(x)$. This is a formalization of the general notion that an event is more informative if it is unlikely, and its information tends to $0$ as its probability tends to $1$.

Generally speaking, we could measure the amount of uncertainty in an entire distribution $p(x)$ with the expected self-information of its elements:

$$H(X)=\mathbb{E}_{x\sim p}[I(x)] = -\mathbb{E}_{x\sim p}[\log P(x)] = -\int_{-\infty}^\infty p(x)\log p(x)$$

This is defined as the Differential Entropy of the distribution. Distributions that are nearly deterministic have very low entropy; distributions that are closer to uniform have high entropy.

$B)$ We define a maximum entropy distribution as any distribution whose entropy is as great or greater as that of any other distribution of the same family.

$C)$ There is a theorem whose proof is clearly exposed here that states that, given certain variance, a Gaussian random variable has the largest entropy amongst all random variables with equal variance.

$D)$ Lastly, it is a reasonable choice —if you are taking the Bayesian perspective— to pick as your prior whatever distribution has maximum entropy, because it means you are making the least possible amount of assumptions. This can be seen as an application of Occam's razor, which you may find interesting if more philosophical ideas appeal to you.

$\endgroup$
9
  • $\begingroup$ Thank you for your answer! Can you please go over maximum entropy a bit more? Thanks! $\endgroup$ Commented May 10, 2022 at 13:26
  • 1
    $\begingroup$ I have expanded my answer to suit your request $\endgroup$ Commented May 10, 2022 at 15:11
  • $\begingroup$ Thank you so much for your updated answer! I have started reading about maximum entropy! Just a question - do you know anything about Bellman Optimality? I posted a question here: math.stackexchange.com/questions/4447435/… . Do you have any comments about this? $\endgroup$ Commented May 10, 2022 at 16:59
  • $\begingroup$ No problem, I hope my answer was able to clarify at least some of your doubts. If it was helpful, please don't forget to mark the answer. I do not know much about Bellman Optimality but I will revise your other question and see if I can help. $\endgroup$ Commented May 10, 2022 at 17:08
  • $\begingroup$ Thank you! I will check your answer! $\endgroup$ Commented May 10, 2022 at 17:21
1
$\begingroup$

I think that if the maximal entropy property pointed out in another answer is relevant at all, it is for a different reason than "modeling a high degree of uncertainty" (note that GP is a stochastic process - a probability distribution over functions - such that each value is a 1D Guasian; maximal entropy property of 1D Gaussian have a priori little to do with the choice of the distribution over functions that you make for your model).

The main thing that makes everything work is that conditioning multivariate Gaussian gives a multivariate Gaussian.

The reason this is key for BO is that together with Kolmogorov's extension theorem, this means that the class of Gausian processes is closed under Bayesian updating. Now, add the fact that a Gaussian is determined by its mean and covariance on top of that, and you are in business.

The reason that it is at all connected to the maximal entropy property is that properly understood multivariate maximal entropy property implies the key property "conditioning a multivariate Gaussian gives a multivariate Gaussian". Namely, maximal entropy distribution conditioned on a collection of expectations is a memeber of an exponential family of an appropriate kind. Adding an observation of one of the components is simply adding another expectation (of a "delta function of the component"), that happens to preserve Gussianness of the corresponding exponential family. (Another explanation of this property can be given via central limit theorem).

$\endgroup$

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.