On posterior in the Bayesian bootstrap

Question

The Bayesian bootstrap was introduced by Rubin (1981) as a Bayesian analog of the original bootstrap. Given dataset $X=\{x_1, \dots, x_N\}$, instead of drawing weights $\pi_{n}$ from the discrete set $\left\{0, \frac{1}{N}, \ldots, \frac{N}{N}\right\},$ the Bayesian approach treats the vector of weights $\pi$ as unknown parameters and derives a posterior distribution for $\pi$. Rubin (1981) used an improper, non-informative prior, $\prod_{i=1}^{n} \pi_{i}^{-1},$ which when combined with the multinomial likelihood, leads to a Dirichlet(1,...,1) distribution for the posterior distribution of $\pi$. In other words, our prior is

\begin{equation} p(\boldsymbol{\pi}) =Dirichlet(\boldsymbol{\alpha}), \quad \text{with}\ \boldsymbol{\alpha} = [0,\dots,0]. \end{equation}

and posterior is

\begin{equation} p(\boldsymbol{\pi}|\boldsymbol{x}) =Dirichlet(\boldsymbol{\alpha}), \quad \text{with}\ \boldsymbol{\alpha} = [1,\dots,1]. \end{equation}

Now my questions are:

I was asked the following questions which I was not able to answer: How can you have a posterior distribution that a/ does not depend on data and b/ is a uniform distribution?
Are both the prior and the posterior non-informative? I understand that the posterior is a uniform distribution, which is non-informative. Also, I see that the prior is referred to as non-informative prior. Does that mean it's flat?

I believe that section 5 in Rubin (1981) addresses these questions, but I do not comprehend that discussion. Any clarification or pointing out what I may be misunderstanding would be appreciated.

EDIT: I just noticed one more issue when computing the posterior. Let $d=\left(d_{1}, \ldots, d_{K}\right)$ be the vector of all possible distinct values of $X$, and let $\pi=\left(\pi_{1}, \cdots, \pi_{K}\right)$ be the associated vector of probabilities $$ P\left(X=d_{i} \mid \pi\right)=\pi_{i}, \quad \sum \pi_{i}=1 $$ Let $x_{1}, \ldots, x_{n}$ be an i.i.d. sample from the equation above and let $n_{i}$ be the number of $x_{j}$ equal to $d_{i}$. If we use improper prior above over the sampling weights $\pi$, we can compute the posterior over $\pi$

\begin{align*} p(\boldsymbol{\pi}|X) &\propto p(X|\boldsymbol{\pi})p(\boldsymbol{\pi})\\ & \propto \prod_{i}\pi_i^{n_i}\prod_{i}\pi_{i}^{\alpha_i-1}\\ & \propto \prod_{i}\pi_i^{n_i}\prod_{i}\pi_{i}^{-1}\\ & \propto \prod_i\pi_i^{n_i-1}. \end{align*} How does this yield a flat Dirichlet posterior? Are we assuming $n_i=1$ for $i=1,\dots,K$? In that case, is the vector of all possible observations $d=\left(d_{1}, \ldots, d_{K}\right)$ (the original sample that we resample from) our observation?

@NeilG Should I say "it does not provide much information" instead? — Blade
– Blade, Commented Aug 14, 2020 at 22:22
I'm pretty sure that's incorrect since it's not invariant under reparametrization. The Jeffreys prior is $\alpha_i=\frac12$ for example. — Neil G
– Neil G, Commented Aug 16, 2020 at 18:26
This prior is called Haldane's prior but faces the drawback that whenever one component of $X$ is equal to zero, the associated posterior does not exist. It should not be used, as a result. — Xi'an
– Xi'an, Commented Aug 17, 2020 at 4:15
@Xi'an Brilliant! I never thought of it that way. Your comment made me realize in fact that exactly is why we should use $\alpha=0$. It also gives answer to this: stats.stackexchange.com/questions/483206/…. Bootstrap imposes a strong assumption: zero probability for the data that has not been observed. That exactly is why Rubin chose this prior: to have zero posterior probability for the unseen data. — Blade
– Blade, Commented Aug 17, 2020 at 13:35

guy · Accepted Answer · 2020-08-17 00:28:05Z

It is relatively easy to understand the Bayesian bootstrap in a "large-but-finite" sample space prior.

Suppose that $X_i$ takes values in a finite set $\mathcal X$ where the size of $\mathcal X$ is thought of as very large but finite --- say, $\mathcal X$ is the collection of all real numbers which can be represented by floating point numbers on your computer. Clearly, for the vast majority of practical purposes, we lose nothing by restricting attention to distributions on $\mathcal X$ versus distributions on $\mathbb R$.

Since $\mathcal X$ is large but finite, any distribution on $\mathcal X$ is represented by some vector $\pi = (\pi_x : x \in \mathcal X)$, and we can place a Dirichlet prior on it: $\pi \sim \mathcal D(\alpha, \ldots, \alpha)$. The posterior distribution of $\pi$ will also be Dirichlet, with shape $\pmb \alpha = (\alpha_x : x \in \mathcal X)$ where $\alpha_x = \alpha$ if $x$ is not observed and $\alpha_x = 1 + \alpha$ if $x$ is observed exactly once. In general we have $\alpha_x = n_x + \alpha$ if we observe ties where $n_x$ is the number of observations equal to $x$.

Now suppose we get our sample of $X_i$'s and we do not observe any ties. We get the Bayesian bootstrap in the limiting case $\alpha \to 0$. The values $x$ we do not observe in the sample have $\pi_x \to 0$ in distribution as $\alpha \to 0$, so those get ignored.

This makes it clearer that the posterior does depend on the data --- the data tells us which support points of $\mathcal X$ the posterior will assign non-zero probability to. So the data is actually quite important.

Edit

Vis-a-vis the comments:

The reason Rubin chose this prior was specifically to match Efron's bootstrap to the extent possible. The goal was actually to criticize the bootstrap, as Rubin felt that the prior was absurd. At some point, his attitude seems to have changed, as later work by him and his collaborators use the Bayesian bootstrap.
Yes, $n_x = 1$ for the Bayesian bootstrap with probability 1 whenever the truth is continuous. But you can define a Bayesian bootstrap on discrete spaces as well, and there you might have $X_i = X_{i'}$ for some $(i,i')$ in which case the shape associated to the shared value would be $2$ rather than $1$. The event $X_i = X_{i'}$ is what I would call a "tie." This never happens in theory with continuous distributions, but it happens all the time with real "continuous" data.
You can't use the uniform prior with $\alpha = 1$ and get any sensible answer, at least within the context of my motivation. What happens in the large-but-finite $\mathcal X$ setting is that it now depends how big $\mathcal X$ is --- if $\mathcal X$ is very large then the posterior will actually not carry very much information about the distribution, because the posterior will say that the majority of the mass in $\pi$ is still on the elements of $\pi$ which have not been observed. Unlike the Bayesian bootstrap, how severe this is would depend on precisely what $\mathcal X$ looks like. The role of sending $\alpha \to 0$ is that it kills all the values in $\mathcal X$ that we did not observe; you don't get that if $\alpha = 1$ instead. The point is that the "correct" way to think of a $\mathcal D(0,1,1)$ distribution is that $\pi_1 = 0$ almost surely and $(\pi_2, \pi_3) \sim \mathcal D(1,1)$.

+1. Thanks, it gave me a good idea on the data dependence of the posterior. So essentially what I understand is that our data is "$n_x=1$ for all x's". Right? On 4th paragraph, what do you mean by "do not observe any ties"? I'm not sure if I fully understand that paragraph. — Blade
– Blade, Commented Aug 17, 2020 at 0:03
Also, any comments on why Rubin chose this particular prior? Even himself states that "the choice of all $l_k = -1$ in the prior distribution for the simple BB is very questionable", and also how it makes sense to have a posterior that is uniform (apart from technical feasibility addressed by Thomas)? I feel like I could have assigned a flat distribution without knowing anything about anything from he first place. For me, it made more sense to start with a uniform Dirichlet as prior (as Rubin mentioned himself on the bottom of p. 133) and have a posterior with $\alpha=2$. — Blade
– Blade, Commented Aug 17, 2020 at 0:05
I also have couple other questions on the topic (one bountied) if you were interested. Thanks! — Blade
– Blade, Commented Aug 17, 2020 at 0:05
@Blade I edited my question to answer your questions in the comments. — guy
– guy, Commented Aug 17, 2020 at 0:28

Blade · Accepted Answer · 2020-08-18 02:27:47Z

The posterior distribution only appears to not depend on the data. In fact, it places equal probability on all observed data values and zero probability on all unobserved values. As Rubin says near the top of p. 131:

Each BB replication generates a posterior probability for each $x_i$ where values of $X$ that are not observed have zero posterior probability.

"Non-informative" is a less popular term now than it was in the past, because it's hard to define it in way that is meaningful and useful. The prior on $\pi$ is not flat -- it is more spread out than a flat prior, so it has more chance of $\pi_i$ being near $0$ or $1$. The posterior of $\pi_i|X$ is flat. The posterior of $X$ is not flat: it is concentrated on the $n$ observed values, with no probability assigned anywhere else.

There's no problem with a flat posterior on a bounded space, as here. You just have to start out with a prior that's more spread out than a flat one. What you can't have is a flat posterior on an unbounded space, because that's not a proper distribution. Check this out.

You can't derive the posterior of $X$ using Bayes' Rule, because what we really have a posterior distribution for is just the weights. The posterior puts zero weight on all unobserved $X$ values, so the prior would also have to put zero weight on all unobserved $X$ values, but we don't know yet what they're going to be. In that sense, there is something dodgy going on.

Since 1981 we have more satisfactory Bayesian analogues, such as a Dirichlet Process($\alpha$, $G$) model, where there's a parameter $\alpha$ such that posterior puts weight $1/(n+\alpha)$ on each observed value and weight $\alpha/(n+\alpha)$ on everything else, proportional to a specified distribution $G$. You can sample from the DP posterior by sampling from the data with probability $n/(n+\alpha)$ and from $G$ with probability $\alpha/(n+\alpha)$.

Even here, you can't derive the posterior for an uncountable space such as the real line using Bayes' Rule. The space of possible distributions is too big; they can't all be written as densities with respect to the prior (or with respect to any other single probability measure). The posterior is derived by a conjugate-prior argument instead.

1. I don't understand the point of view of the guy who asked the question "does not depend on data". Can you tell me what did he expect to see that he doesn't see? 2. Any comments on how a posterior can be uniform and why this is surprising to people? 3. I know prior p(\pi), likelihood p(X|\pi), and posterior p(\pi|X). What is the posterior of X that you referred to? Can you write the Bayes rule for it? — Blade
– Blade, Commented Aug 13, 2020 at 7:02
a. So you mean because we are computing the posterior on $\pi$ before actually sampling BB samples it looks like that we are not using the data? b. So you mean since Dirichlet distribution is always bounded to the [0, 1] interval, we can have flat Dirichlet as posterior? I’m still not able to grasp why a flat distribution over unbounded space is not proper though. — Blade
– Blade, Commented Aug 14, 2020 at 23:04

Stack Exchange Network

On posterior in the Bayesian bootstrap

2 Answers 2

Linked

Hot Network Questions

On posterior in the Bayesian bootstrap

2 Answers 2

Linked

Related

Hot Network Questions