Bayesian optimization for solving least squares

Question

Bayesian optimization with Gaussian processes (GPs) is an effective minimization methodology when the evaluation of the function to minimize, say $f(a)$, is computationally expensive.

Loosely speaking, the strategy requires a GP, chosen to emulate $f$ but much easier to compute.

Let us assume that $f(a) = (g(a) - y)^2$, where $y$ is a set of data points and $g(a)$ is the result of a compute-intensive simulation model. We wish to find the value $a_0$ of $a$ such that the compute-intensive model $g(a)$ ''reproduces'' the data $y$.

$f(a)$ has now some new properties compared to $g(a)$ (for example it is positive and convex around $a=a_0$, while $g(a)$ might not be so) and a typical Gaussian process (which is centered at zero) is not a good surrogate for it. Does literature provide alternatives to GPs which incorporate those properties? Or did I miss something else?

Why do you need the surrogate model when the least squares solution is something you can calculate by-hand? — AdamO
– AdamO, Commented Jan 9, 2024 at 15:38
It’s only worthwhile to use BO when the underlying problem is expensive to optimize or evaluate, or is not convex. But OLS is one of the most efficient models to estimate, and is strongly convex under reasonable circumstances, so it’s hard to understand why you think the juice is worth the squeeze. — Sycorax
– Sycorax ♦, Commented Jan 9, 2024 at 15:49
Can you edit to clarify what problem you're trying to solve and how the compute-intensive model and BO fit into solving it? As it is written right now, the question appears to be an XY Problem. — Sycorax
– Sycorax ♦, Commented Jan 9, 2024 at 17:47
Hi, it would be most helpful for us to understand a little more about why you think a GP won't work. You mention that they're centered around zero while your problem isn't, but most GP libraries allow estimation of a constant term, and even if they don't, you could just rescale your data to have mean zero and unit variance (and probably should anyways). It's also not clear to me what we can do with convexity around $a=a_0$, since we have no idea how far that convexity extends. (and note that a GP will have that property about its minimum anyways*). — Nathan Wycoff
– Nathan Wycoff, Commented Jan 9, 2024 at 19:08
Ok, so the setting is just a standard BO problem: you want to find the minimum of some function (specifically, the function $(g(a) - y)^2$) in as few iterations as possible. The revised question is clear and answerable, so I've re-opened. That said, I think I don't think the perceived obstacles in the question are real challenges to applying BO in this setting. Or if they are, you haven't explained how/why. // In a closely related question, I outline some alternatives to BO stats.stackexchange.com/questions/193306/… — Sycorax
– Sycorax ♦, Commented Jan 9, 2024 at 20:04

AdamO · Accepted Answer · 2024-01-10 19:20:53Z

Summarizing comments as the only possible answer to this question:

Using Bayes Optimization to minimize an objective function is just repurposing Gibbs Sampling as an MCMC procedure to approximate maximum likelihood. In other words, if you sum the deviance, call it a likelihood, slap a completely non-informative prior on it, it's possible to use JAGS or WinBUGS as a very expensive and complicated non-linear minimization tool. Note, this is not a Bayes Optimal estimator which is the optimal solution for minimizing the MSE, compared to OLS, this swaps an unbiased/high variance estimator with a lightly biased, very low variance estimator.

Minimizing an objective function in R is not an issue at all. The function nlm has great documentation. You can solve OLS with it.

set.seed(123) x <- 1:20 y <- rnorm(20, x) nlm(function(b) var(y - b[1] - b[2]*x), c(0,0), hessian=TRUE)

As noted, OLS has an analytic solution so it's the exact opposite of the type of problem requiring non-linear minimization. The OLS estimate is simply cov(x,y)/var(x).

But consider a more abstract objective function/metric

$$ \phi(d) = d^2/(1+d^2) $$

which locally approximates a quadratic in a neighborhood of 0 but reduces weight on residuals. This solution is not analytic. Since the function is smooth it could be solved with Newton Raphson, but use Backward Gradient Descent instead:

f <- function(b) { e <- y - b[1] - b[2]*x sum(e^2/(1+e^2) } nlm(f, c(0,0))

can you provide a reference for the brand of "Bayesian Optimization" that involves MCMC? The brand of Bayesian Optimization I am familiar with does not use MCMC*, but rather uses some surrogate model, often a Gaussian process, to approximate the objective function, and then sequentially finds points which are, in a hand-wavy sense, "optimally-likely" to be the optimum of the function. This is the type of "Bayesian Optimization" described in the literature I know and Garnett's text: bayesoptbook.com. *(sometimes, MCMC is used as part of doing Bayesian inference on the surrogate). — Nathan Wycoff
– Nathan Wycoff, Commented Jan 10, 2024 at 20:04
additionally, at least on my computer/with my account, searching Google Scholar for "Bayesian Optimization" yields the the procedure I have described, not that procedure of using MCMC as an optimization algorithm directly. — Nathan Wycoff
– Nathan Wycoff, Commented Jan 10, 2024 at 20:06
(Also a minor nitpick: though your function is twice differentiable, it is not convex, and will not be reliably optimized by standard newton raphson methods without significant dampening. NR is best for convex functions). — Nathan Wycoff
– Nathan Wycoff, Commented Jan 10, 2024 at 20:09
@JohnMadden OP asked for a convex objective function. It's not necessary to illustrate this off-the-shelf optimizer. Scanning the book, the "Bayesian" approach deals with obtaining an estimate of the posterior predictive distribution. Though Garnett doesn't spell out the many established methods of doing so, Gibb's Sampling remains the best method to do it. INLA would be the alternative. — AdamO
– AdamO, Commented Jan 10, 2024 at 20:43
Thanks for your reply. May I boldly submit that your scan was too much of a scan? Garnett does not spell out Gibbs sampling methods because Gibbs sampling has little to do with BO. (Beyond estimating the surrogate or its hyperparams, and even then only in some implementations). — Nathan Wycoff
– Nathan Wycoff, Commented Jan 10, 2024 at 20:55

Stack Exchange Network

Bayesian optimization for solving least squares

1 Answer 1

Linked

Hot Network Questions

Bayesian optimization for solving least squares

1 Answer 1

Linked

Related

Hot Network Questions