Gaussian distributions are maximum entropy distributions and therefore minimize the amount of prior knowledge that is being put into the assumption. In other words, they are a good choice when modeling a high degree of uncertainty. To my understanding there is nothing beyond that. And yes, it could be theoretically correct to base the surrogate on other functions, provided that those functions fit some notion of prior knowledge.
There is a little bit of information about this in Goodfellow's Deep Learning.
EDIT: There are many kinds of probability distributions. Depending on the context, you will model an experiment (or some data) with a discrete model (a coin toss, for example), a continuous model (if you are evaluating weight, for example) or some more general probability measure. (If curious about probability measures, the first chapters in Jun Shao's Mathematical Statistics are a good choice of study.)
$A)$ We define the self-information of an event $\text{x}=x$ as $I(x)=-\log P(x)$. This is a formalization of the general notion that an event is more informative if it is unlikely, and its information tends to $0$ as its probability tends to $1$.
Generally speaking, we could measure the amount of uncertainty in an entire distribution $p(x)$ with the expected self-information of its elements:
$$H(X)=\mathbb{E}_{x\sim p}[I(x)] = -\mathbb{E}_{x\sim p}[\log P(x)] = -\int_{-\infty}^\infty p(x)\log p(x)$$
This is defined as the Differential Entropy of the distribution. Distributions that are nearly deterministic have very low entropy; distributions that are closer to uniform have high entropy.
$B)$ We define a maximum entropy distribution as any distribution whose entropy is as great or greater as that of any other distribution of the same family.
$C)$ There is a theorem whose proof is clearly exposed here that states that, given certain variance, a Gaussian random variable has the largest entropy amongst all random variables with equal variance.
$D)$ Lastly, it is a reasonable choice —if you are taking the Bayesian perspective— to pick as your prior whatever distribution has maximum entropy, because it means you are making the least possible amount of assumptions. This can be seen as an application of Occam's razor, which you may find interesting if more philosophical ideas appeal to you.