How to formalize "conditional random variables"

Question

I've been using "conditional random variables" as a notation aid with some good success in problem solving. But I've heard people claim that one shouldn't define conditional random variables.

By a conditional random variable for $X$ given $Y$, a "pseudo" random variable $(X|Y)$ with the density function $f_{X|Y = y}(x) = \frac{f_{(X,Y)}(x,y)}{f_Y(Y)}$.

Does this path lead to ambiguity or contradiction? It seems pretty straight-forward to interpret $(X|Y)$ as a function from the sample space of $Y$ to the random variable $X$, so that $X$ is a random random variable. But is this abuse of notation sound?

More generally, what kinds of functions can be composed to make random variables while remaining consistent with "the" axioms of probability (i.e., some sensible foundation)?

Perhaps tangentially, is there a categorical interpretation? In particular, it would be nice if $(X|Y)$ and $Y$ are an adjoint pair.

This question has got some attention recently, so I thought I'd try to clarify my question again:

I guess my question is "how can we define choosing a random variable randomly?" After all, we can pick a random matrix, random people, random heights, etc. So why not arbitrary real functions?

Presumably, this would require a probability distribution to assign densities to real functions. This may not even be possible in the "general" case, and this might be a reason why the construction I'm trying to get at is unsound.

But it certainly seems that we can define conditional random variables for "classes" of random variables, for example by treating a parameter of a probability distribution as a random variable.

Conditional expectation seems to be another instance of the idea.

So there seems to be a tension between these instances and the "fact" that it can't be done in general. I am hoping someone can talk to us about it. :-)

"It seems pretty straight-forward to interpret (X|Y) as a function from the sample space of Y to the random variable X, so that X is a random random variable." I would be VERY curious to see that. Actually, what most people object to the notation $(X\mid Y)$ is that it does not correspond to any random variable. So, if you have an idea in this direction, please share! — Did
– Did, Commented Dec 19, 2013 at 7:47
If $\{ Y = y \}$ doesn't have measure zero and can therefore be re-scaled into a probability space, you're just talking about the restriction of $X$ to that space, which is a random variable whose density is the conditional density. — Michael
– Michael, Commented Dec 19, 2013 at 11:34
Could you expand your comment "defining" $X\mid Y$ as a random variable? I am not sure to follow. Let $Z=(X\mid Y)$. You are saying that for each $\omega$ in $\Omega$, $Z(\omega)$ is... what, exactly? Your idea seems to be different from Michael's (which runs into into its own problems, by the way), but let us stick to your version. — Did
– Did, Commented Dec 19, 2013 at 18:37
The "support" of $X$? Why on Earth should the support be involved? That $X(\omega)=0$ of $X(\omega)=42$ should make no difference. // Next: are you aware that restricting $X$ to (a subset of) $\{\omega\}$ yields a function defined on (at most) a singleton? Thus your suggestion is that $(X\mid Y)(\omega)(\omega)=X(\omega)$ and that $(X\mid Y)(\omega)(\omega')$ is undefined when $\omega'\ne\omega$... Thus, $Y$ disappeared? // To sum up, I am sorry but all this is absurd and definitely not how the conditioning of random varioables is defined. (Unrelated: please use @.) — Did
– Did, Commented Dec 20, 2013 at 8:08
Then this collapses from the other side, which is that one wants $(X\mid Y)$ to be a random variable, that is, to be defined on $\Omega$, not on some collection of subsets of $\Omega$. (But, frankly, to use $\omega$ to denote subsets of $\Omega$ is really pushing a little too far the idiosyncrasy...) — Did
– Did, Commented Dec 20, 2013 at 21:33

Adam Williams · Accepted Answer · 2018-07-20 01:14:05Z

But is this abuse of notation sound?

As others have noted in the comments, the answer is not quite. But to inform your understanding of why not, it may be helpful for you to read about the concept of conditional expectation, which may be the closest formal approximation of what you're trying to get at.

The setup for the definition requires you to brush of on your measure theoretic probability, and consists of:

A probability space $(\Omega, \mathcal{F}, P)$.
A random variable $X : \Omega \to \mathbb{R}^n$.
Another random variable $Y : \Omega \to U$ (where $(U, \Sigma)$ is some other measure space).

The conditional expecation $\mathbb{E}( X \mid Y )$ is, in a precise sense, the $L_2$-closest $Y^{-1}(\Sigma)$-measurable approximation of $X$. That is, it answers the question What is the most that we can know about $X$ given information that we can glean from observing $Y$?

More formally, letting $\mathcal{H} = Y^{-1}(\Sigma)$, $\mathbb{E}(X \mid Y)$ is an $\mathcal{H}$-measurable random variable (i.e. it is as "coarse" as $Y$) which is is guaranteed to agree with $X$ on any event $H \in \mathcal{H}$:

$$ \int_H \mathbb{E}(X \mid Y) \, dP = \int_H X \, dP. $$ Its existence is proved via the Radon-Nikodym theorem.

And furthermore,

Perhaps tangentially, is there a categorical interpretation?

while I don't have a strong grasp of category theory and so I won't try to explain it it categorical terms, conditional expectation does have a nice interpretation in terms of factorization / commutative diagrams, as can be seen on the wikipedia page :)

Nick · Accepted Answer · 2021-09-21 00:39:03Z

In the paper (Jardin et al., 2006) the next notation was used:

It is defined as the conditional random variable: $$T - t|T>t, Z(t),$$ where $T$ denotes the random variable of time to failure, $t$ is the current age and $Z(t)$ is the past condition profile up to the current time.

Reference. Jardine A K, Lin D and Banjevic D 2006 Mechanical Systems and Signal Processing 20 1483–1510

Lichtung2014 · Accepted Answer · 2025-02-10 16:51:59Z

The standard construction of $X$ as a function of $Y$ and another uniform random variable $E$ on $[0,1]$ independent of $Y$, such that $X=f_X(Y,E)$, is shown in the paper Distinguishing cause from effect using observational data: methods and benchmarks within the proof of Proposition 4.

The idea is to use the conditional cumulative distribution function

$$F_{X|y}(x)=P(X \leq x|Y=y):=\lim _{h \to 0} \frac{P(X\leq x, \ y-h<Y\leq y+h)}{P(y-h<Y\leq y+h)}$$

to transform the conditional random variable into a uniform random variable on $[0,1]$ under absolute continuous assumption, $E:=F_{X|Y}(X)$.

Then $f_X$ can be defined by

$$f_X(y,e)=F^{-1}_{X|y}(e):=inf\{x\in \mathbb{R}\ |\ e\leq F_{X|y}(x) \}, \ e\in [0,1].$$

This should be the way to interpret $X$ as a random random variable. More explicitly, given $y$, you can transform a uniform distribution to any distribution through the generalized inverse of its cumulative distribution function (the quantile function) and attach it to $y$ as the conditional distribution of $X$, together forming the full random variable $X$.

This was also used in the linkage concept for copula and can have applications even in finance.

matovitch · Accepted Answer · 2024-08-21 06:07:30Z

If you have studied information theory you should be familiar with information diagrams like this one:

In somne sense, there is a kind of "algebra of set" for (conditional or not) random variables. In fact the entropy acts as a "functor" with values in $(\mathbb{R^+}, \leq)$. The question is then, what is the domain of this functor. As explained by Adam, conditional random variables aren't really random variables.

Shannon's channel coding theorem is hinting at what could be the right objects. There is this concept of source represented by discrete random variables with values in an alphabet. These sources can be compressed, encoded, transmitted and decoded to form some other sources.

For example, given some source, you can encode by bloc of size $n$ this source as a prefix distinct binary code using Huffman algorithm.

But you can also encode the conditional realization of a source $Y$ relative to an other source $X$. Using the asymptotic equipartition property (AEP), you can show that these encodings $B_{Y^n|X^n=x^n}$ are of length $nH(Y|X)\pm o(n)$ (using Landau's notation here).

Using some binary tree rebalancing you can ensure that all your encodings have a minimal length of $nH(Y|X) - |o(n)|$, such that their prefixes of length $nH(Y|X) - |o(n)|$ form a family of surjections indexed by $x^n$. By choosing right inverses for these surjections one can encode $B_{Y^n|X^n=x^n}$ with a prefix $B_{Y^n|X^n}$ which is only a function of $y^n$ and some suffix which depends upon $x^n$ (and $y^n$) whose entropy is an $o(n)$.

Once $B_{Y^n|X^n}$ is defined you can for example speak of $B_{Y^n|B_{X^n|Y^n}}$ whose entropy is $nI(X,Y)+o(n)$. You get a real category with products and coproducts.

Disclaimer, I own no math degree and have no academic experience but the channel coding theorem had quite an impression on me so I tried to define this category and found a new constructive proof of this theorem as a result.

I've written down everything more clearly here, feel free to have a look: https://matovitch.github.io/itccc/.

This hasn't been reviewed so any comment/feedback is welcome.

Stack Exchange Network

How to formalize "conditional random variables"

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

How to formalize "conditional random variables"

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions