Conditional Expectation and Interpretation of Projection of $Y$ Onto $X$ vs. $Y$ onto $L^{2}(\sigma(X))$

Question

Suppose $X,Y\in L^{2}(\Omega,\mathcal{F},\mathbb{P})$ are real-valued square-integrable random variables and that the joint density of $(X,Y)$ denoted by $f_{XY}(x,y)$ exists.

Then as is well-known, the conditional expectation of $Y$ given $X$ can be computed explicitly by the following formula (where we have used $X(\omega)$ instead of $X=x$ to emphasize that the conditional expectation is a fully-fledged random variable that is merely $\sigma(X)$-measurable):

$$\begin{align*} (1)\;\;\;\;\mathbb{E}[Y|X](\omega) &=\int_{-\infty}^{\infty}y\;f_{XY}(X(\omega),y)\;dy\;\Bigg/\;\int_{-\infty}^{\infty}f_{XY}(X(\omega),y)\;dy \\&=\frac{1}{f_{X}(X(\omega))}\int_{-\infty}^{\infty}y\;f_{XY}(X(\omega),y)\;dy. \end{align*}$$

It is straight-forward to prove that $$||Y-\mathbb{E}[Y|X]||^{2}_{L^{2}(\mathcal{F})}=\inf_{Z\in L^{2}(\sigma(X))}||Y-Z||^{2}_{L^{2}(\mathcal{F})},$$ which in turn proves (because conditional expectation is linear) that $$\mathbb{E}[Y|X]=\mathbb{proj}_{L^{2}(\sigma(X))}Y,$$ i.e. the conditional expectation operator is really a projection operator from $L^{2}(\mathcal{F})$ to the closed subspace $L^{2}(\sigma(X))$.

When I first learned about this fact I was tempted to assert that given $X$ we have \begin{align*} (2)\;\;\;\;\mathbb{E}[Y|X](\omega) &\stackrel{?}{=}\frac{(X,Y)_{L^{2}}}{||X||_{L^{2}}^{2}}X(\omega) \\&=\frac{\mathbb{E}[XY]}{\mathbb{E}[X^{2}]}X(\omega) \\&=X(\omega)\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}xy\;f_{XY}(x,y)\;dxdy\;\Bigg/\;\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}x^{2}f_{XY}(x,y)\;dxdy \end{align*} because in a Hilbert space this can be thought of as the orthogonal projection of $Y$ in the direction of $X$. However, the conditional expectation is an orthogonal projection into the entire subspace $L^{2}(\sigma(X))$, and so the formula above only holds for one random variable, namely $$X(\omega)=\mathbb{E}[Y|X](\omega),$$ which defeats the purpose of using this formulation as a computational device for $\mathbb{E}[Y|X](\omega)$. Of particular note, if $\sigma(Z)=\sigma(X)$ (which implies $Z\in L^{2}(\sigma(X))$), then $$\mathbb{E}[Y|X](\omega)=\mathbb{E}[Y|Z](\omega).$$

Still I wonder, even if clearly wrong, whether (2) is known to be useful in any way. So I have a couple of questions:

Given $X$, can we easily find an orthonormal basis $\{X_{n}\}_{n>0}$ that spans $L^{2}(\sigma(X))$? By easily, I mean in terms of the marginal and joint distributions of $X$ and $Y$, without explicit reference to whatever the underlying sample space looks like. If so, then in principle (2) could be made valid by summing the formula over each $X_{n}$. Of course, I think this would likely destroy the original motivation of finding an explicit way to represent the conditional expectation using the intuition of orthogonal projections.
Can we relate (1) to such an orthonormal set $\{X_{n}\}_{n>0}$? In particular, show that it arises from the sum of orthogonal projections.
Are there any useful applications of orthogonal projection onto a particular random variable $X$ (as opposed to orthogonal projection onto an entire subspace $L^{2}(\sigma(X))$ which is represented by the conditional expectation).

Sargera · Accepted Answer · 2015-10-27 21:24:06Z

This question actually helped me clarify the meaning of linear regression, and in particular, distinguish it from the condition expectation. So I'm going to go ahead and answer it myself, but hopefully someone else can chime in with a different answer (possibly with a different application). Also, since I'm answering this for myself, it's a bit more detailed/reflective than it probably needs to be, and may not be 100% error-free.

If $X\in L^{2}(\sigma(X))$, then we can consider the linear span $\mathcal{L}(S_{X})$ of the set $S_{X}=\{\mathbb{1}_{\Omega},X\}$ (note that $\mathbb{1}_{\Omega}\equiv1$ is $\sigma(X)$-measurable and that $S_{X}$ is linearly independent if $X$ is not a constant): $$\mathcal{L}(S_{X})=\{Z\in L^{2}(\sigma(X))\;|\;Z(\omega)=a1_{\Omega}(\omega)+bX(\omega)=a+bX\}.$$

Since $|S_{X}|=2<\infty$, $\mathcal{L}(S_{X})$ is a closed linear subspace of $L^{2}(\sigma(X))\subset L^{2}(\Omega)$, and so we can apply Hilbert Space theory (actually, in this case finite-dimensional linear algebra) to find the unique $\hat{Y}\in\mathcal{L}(S_{X})$ such that $$||Y-\hat{Y}||_{2}=\inf_{Z\in\mathcal{L}(S_{X})}||Y-Z||_{2},$$ which we know is $$\hat{Y}(\omega)=(\mathbb{proj}_{\mathcal{L}(S_{X})}Y)(\omega).$$

To compute this in the usual way, we need to first apply the Gram-Schmidt procedure to $S_{X}$ since $(\mathbb{1}_{\Omega},X)_{2}=EX\neq0$ (i.e., $S$ is not an orthogonal set). Since $\mathbb{proj}_{1_{\Omega}}X=X-EX:=X-\mu_{X}$, we simply transform $S_{X}$ into the set $$S_{X}\mapsto\{1_{\Omega}, X-\mu_{X}\}.$$

Now $S_{X}$ is orthogonal and we have (using the usual notation $\mathbb{Var}=\mathbb{E}[(X-\mu)^{2}]=\sigma^{2}_{X}$ and $\mathbb{Cov}(X,Y)=\mathbb{E}[(X-\mu_{X})(Y-\mu_{Y})]=\sigma_{XY}$): $$\begin{align*} \hat{Y}(\omega) &=(\mathbb{proj}_{1_{\Omega}}Y)(\omega)+(\mathbb{proj}_{X-\mu_{X}}Y)(\omega) \\&=\frac{(1_{\Omega},Y)_{2}}{||1_{\Omega}||_{2}^{2}}1_{\Omega}(\omega)+\frac{(X-\mu_{X},Y)_{2}}{||X-\mu_{X}||_{2}^{2}}(X(\omega)-\mu_{X}) \\&=\mu_{Y}-\frac{\mathbb{E}[Y(X-\mu_{X})]}{\sigma^{2}_{X}}\mu_{X}+\frac{\mathbb{E}[Y(X-\mu_{X})]}{\sigma^{2}_{X}}X(\omega) \\&=\mu_{Y}-\frac{\sigma_{XY}}{\sigma^{2}_{X}}\mu_{X}+\frac{\sigma_{XY}}{\sigma^{2}_{X}}X(\omega) \\&:=\mu_{Y}-b\mu_{X}+bX(\omega) \\&:=a+bX(\omega) \\&\in\mathcal{L}(S_{X}) \end{align*} $$ This is of course the usual linear regression formula. Note that at the end we used the initially surprising identity $$\mathbb{Cov}(X,Y)=\mathbb{E}[(X-\mu_{X})(Y-\mu_{Y}]=\mathbb{E}[X(Y-\mu_{Y})]=\mathbb{E}[(X-\mu_{X})Y]$$

Note that the addition of $1_{\Omega}$ to $S_{X}$ is analogous to the addition of the constant $e^{i0}=1$ in Fourier analysis (and in dealing with any other orthonormal function basis). In particular, it frees us of the requirement that $\hat{Y}$ "pass through the origin" so that $\hat{Y}$ can be an affine linear transformation of $X$. Note that we get formula (2) in my question when we don't include $1_{\Omega}$ and just project (regress) onto $X$: $$\hat{Y}(\omega)=\frac{(X,Y)}{||X||^{2}_{2}}X(\omega)$$ or if we want to project onto $X-\mu_{X}$ instead: $$\hat{Y}(\omega)=\frac{\sigma_{XY}}{\sigma^{2}_{X}}(X(\omega)-\mu_{X})=b\mu_{X}+bX(\omega).$$

Now, in general we don't have $\mathcal{L}(S_{X})=L^{2}(\sigma(X))$, and so we don't generally expect that $\hat{Y}=\mathbb{E}[Y|X]$ for any given $X$. In particular, we expect to have $$||Y-\hat{Y}||_{2}\geq||Y-\mathbb{E}[Y|X]||_{2}$$ because the selection of $\hat{Y}$ is from a smaller subspace than that of $\mathbb{E}[Y|X]$, even though the same estimator $X$ is being used.

It is a well-known fact that if $Z$ and $X$ are mutually measurable ($\sigma(Z)=\sigma(X)$), then there exists an invertible Borel function $f$ such that $Z=f(X)$ and $X=f^{-1}(X)$. Furthermore, there exists a Borel function $g$ for any $\sigma(X)$-measurable $Z$ such that $\mathbb{E}[Y|X]=g(Z)$ (these facts help clarify that conditional expectation is projection into a subspace defined by a $\sigma$-algebra, not necessarily a particular random variable). Therefore, we can always express $\mathbb{E}[Y|X]$ in terms of any generic $Z$ in $L^{2}(\sigma(X))$ by finding the function $f$ and $g$ (which exist basically as a consequence of how measure-theoretic probability model) $$\mathbb{E}[Y|\sigma(X)](\omega)=\mathbb{E}[Y|X](\omega)=g(Z(\omega))=g(f(X(\omega))):=h(X(\omega)).$$

A natural question is then under what circumstances do we have $h(x)=a+bx$ so that $$E[Y|X](\omega)=a+bX(\omega)?$$ It is obvious that $h(x)=ax+b$ for some choice of $X\in L^{2}(\sigma(X))$ and that $a$ and $b$ must coincide with the above due to the optimization constraint defining $\mathbb{E}[Y|X]$ (in particular, $X$ equal to $\mathbb{E}[Y|X]$ itself or any $Z$ that is equal to $\mathbb{E}[Y|X]$ upto an affine transformation).

However, (in the real world) it is unlikely we would be so lucky as to happen upon such a nice predictor variable $X$ (or have sample data available if we knew what it was). Nevertheless, it is quite common to see people claim that $\mathbb{E}[Y|X]=\hat{Y}$ in practice. That this is justified derives from a number of assumptions in a variety of different contexts and is almost always an approximation when in the realm of statistical inference (for instance, one might start out by forcing some structure or model on $Y$, e.g., $Y=a+bX+\epsilon$, and then obtain under some further assumptions $\hat{Y}=\mathbb{E}[Y|X]$ as a consequence). But at least from a probability theory perspective (where we have the full population data, i.e. the random variables themselves and their distributions), we can say something definitive.

If $(X,Y)$ are jointly normally distributed, then $\mathbb{E}[Y|X]=\hat{Y}$

The proof follows from considering the minimal-variance "hedge" $W:=Y-\frac{\sigma_{XY}}{\sigma_{X}^{2}}X$ and observing that $W$ and $X$ are independent (this follows from $\mathbb{Cov}(X,W)=0$ since $(X,W)$ are jointly normal as well) and calculating $\mathbb{E}[Y|X]$ with the usual rules (e.g., factoring independent variables).

Thus, if $Y$ is normally distributed, we can find $\mathbb{E}[Y|\mathcal{G}]$ by finding an $X\in L^{2}(\mathcal{G})$ such that $(X,Y)$ are jointly normal. But in general, we still face the same problem that we don't generally have the proper $X\in L^{2}(\sigma(X))$ such that $\hat{Y}=\mathbb{E}[Y|X]$. Nevertheless, we can summarize the above as follows.

Given $X$ (or a $\sigma$-algebra $\mathcal{G}$), we want to find $\mathbb{E}[Y|X]\in L^{2}(\sigma(X))$ such that the $L^{2}$-norm optimization constraint holds. If we are given $\sigma(X)$-measureable variables $X_{1},X_{2},\ldots,X_{n}$ then we can estimate $\mathbb{E}[Y|X]$ (and by extension $Y$ itself) using these random variables by projecting onto $\mathcal{L}(S_{X_{1},\ldots,X_{n}})$ (after performing an orthogonalization procedure and eliminating any linearly independent random variables). This has been illustrated for the cases $n=0$ and $n=1$. In particular, $$\hat{Y}_{0}(\omega)\equiv\mu_{Y}$$ and $$\hat{Y}_{1}(\omega)=\mu_{Y}+\frac{\sigma_{XY}}{\sigma^{2}_{X}}(X(\omega)-\mu_{X})$$ and then so on for $\hat{Y}_{2},\ldots,\hat{Y}_{n}$. If $\mathbb{dim}(L^{2}(\sigma(X))=n<\infty$, then we will have $\hat{Y}_{n}=\mathbb{E}[Y|X]$. Although, given the difficulty in visualizing these $\sigma$-algebras and the resulting $L^{2}$ spaces, it's hard to say for sure when the dimension is finite and whether an orthonormal basis can be easily found in either the finite-dimensional or infinite-dimensional cases. One probably has to eventually fall back onto the measure-theoretic definition of $\mathbb{E}[Y|X]$ as a Radon-Nikodym derivative and the guess-and-check strategy (at least in probability theory - in statistics you do things like introduce additional regressors to span a bigger subset of $L^{2}(\sigma(X))$, search for a better predictor function $g=g(X_{1},\ldots,X_{n})$ of the current regressors already in play, or make extra modelling assumptions).

Stack Exchange Network

Conditional Expectation and Interpretation of Projection of $Y$ Onto $X$ vs. $Y$ onto $L^{2}(\sigma(X))$

1 Answer 1

You must log in to answer this question.

Linked

Hot Network Questions

Conditional Expectation and Interpretation of Projection of $Y$ Onto $X$ vs. $Y$ onto $L^{2}(\sigma(X))$

1 Answer 1

You must log in to answer this question.

Linked

Related

Hot Network Questions