Why Assumed Density Filtering is also called Moment Matching?

Question

I am learning about Assumed Density Filtering (ADF) and Expectation Propagation in the context of bayesian deep neural networks. I have seen in some textbooks and papers that ADF is also called moment matching. Is it called moment matching because we always solve it by minimizing KL Divergence by showing the moments of two distributions match? is this the reason behind it?

Nir B · Accepted Answer · 2021-12-15 19:05:22Z

Hope it's not too late...
I also wandered about this when working on my thesis.
In a nutshell, this is true for any approximating distribution which belongs to the exponential family of distributions.

Recall that in EP we are repeatedly performing the assumed density filtering algorithm (ADF) until convergence. The moment matching happens at each ADF iteration, i.e. when solving:

$$ \min_{\xi} D_{KL}\left( \hat{p}_{i}(\theta|\xi_{i-1})||q(\theta|\xi) \right)$$

Where $\hat{p}_{i}$ is what Minka calls the exact posterior and $q$ is the approximating posterior and in general is assumed to belong to the (non-curved) exponential family of distributions. Hence: $$ q(\theta|\xi) = exp\left( \xi^{T}T(\theta)+S(\theta)+A(\xi) \right)$$

We can easily show that $D_{KL}$ in the optimization problem is convex w.r.t $\xi$ thus the first order condition (i.e. equating the derivative to zero) is sufficient for optimality. If you do the math (we also need to assume we can switch between derivatives and integrals) the first order conditions are:

$$ {E}_{\hat{p}_i}\left[ T(\theta) \right] = \frac{\partial}{\partial \xi}A(\xi)$$

To see why the last equation is interpreted as moment matching notice the for distribution from the non-curved exponential family, the first derivative of the log-partition function $(A(\xi))$ is the expectation of the sufficient statistic $(T(\theta))$. So the first order condition can be written as:

$$ {E}_{\hat{p}_i}\left[ T(\theta) \right] = E_{q}\left[T(\theta) \right]$$

Which is interpreted matching expectation of the sufficient statistic between the exact posterior and the approximating posterior.

anshul gupta · Accepted Answer · 2021-07-01 15:14:52Z

Assumed Density Filtering approximates the Bayesian posterior to a distribution we wish to assume. If we assume the posterior to be a Gaussian Approximation, then it can be called Gaussian Density filtering.

So to approximate it, we only need to compute the mean and variance of the resultant distribution and use them as parameters of the gaussian posterior. Hence the name moment matching.

Gaussian-density filtering computes an exact one-step posterior p(θ) and then approximates it to get qnew(θ) which is a Gaussian Distribution. So essentially, we want-

A more detailed explanation of this can be found in T.Minka's paper -https://tminka.github.io/papers/ep/minka-thesis.pdf

To see one application of assumed density filtering, you can read this article Understanding the Working of Multi-Armed Bandit in VWO

Thank you for your answer. So, you mean that the matching of the moments only occur when the assumed density is a gaussian? what if the assumed density is a Laplace for example? in this case after minimizing KLD, we will get a matching among the moments again. Am I right? — samsambakster
– samsambakster, Commented Jul 2, 2021 at 6:37
So the broad idea is because true posterior is intractable we assume a posterior distribution and try to estimate its parameter s.t. our approximated assumed density is close to true posterior. In the case of gaussian, it is simply by matching the mean and variance. The same rule I believe should work for other distributions also like Laplace provided we are able to estimate its parameters from these moment parameters. — anshul gupta
– anshul gupta, Commented Jul 3, 2021 at 6:12

Paul L Fackler · Accepted Answer · 2021-10-05 02:38:16Z

The reason this is called moment matching is that the prior & approximate posterior is generally taken to be in the exponential family, the parameters of which can be estimated by matching the appropriate moments. See Zhou, E., Fu, M. C., & Marcus, S. I. (2010). Solving continuous-state POMDPs via density projection. IEEE Transactions on Automatic Control, 55(5), 1101–1116. There are many names for this approach, which has been rediscovered in numerous fields.

Stack Exchange Network

Why Assumed Density Filtering is also called Moment Matching?

3 Answers 3

Hot Network Questions

Why Assumed Density Filtering is also called Moment Matching?

3 Answers 3

Related

Hot Network Questions