5
$\begingroup$

I am reading The Elements of Statistical Learning (page 44) and came across this statement:

"From a statistical point of view, this criterion is reasonable if the training observations $(x_i, y_i)$ represent independent random draws from their population. Even if the $x_i$’s were not drawn randomly, the criterion is still valid if the $y_i$’s are conditionally independent given the inputs $x_i$."

To explore this, I considered the joint likelihood in two scenarios:

  1. Case 1: Full Independence of $(x_i, y_i)$
    If the $(x_i, y_i)$ pairs are fully independent, the joint likelihood is:
    $$ P(x_1, y_1, \dots, x_n, y_n \mid \theta) = \prod_{i=1}^n P(y_i \mid x_i, \theta) P(x_i \mid \theta). $$

  2. Case 2: Conditional Independence of $y_i$ Given $x_i$
    If the $y_i$’s are conditionally independent given $x_i$, the joint likelihood is:
    $$ P(x_1, y_1, \dots, x_n, y_n \mid \theta) = \left( \prod_{i=1}^n P(y_i \mid x_i, \theta) \right) P(x_1, x_2, \dots, x_n \mid \theta). $$


Observations:

  • Marginals:
    In Case 1, the marginal distribution of the inputs factorizes as $\prod_{i=1}^n P(x_i \mid \theta)$ due to the independence assumption.
    In Case 2, the marginal distribution of the inputs is written as a single term $P(x_1, x_2, \dots, x_n \mid \theta)$ since no independence is assumed for $x_i$.

  • Conditionals:
    In both cases, factoring out the marginals, the conditional distribution of $y_i$ given $x_i$ is identical:
    $$ \prod_{i=1}^n P(y_i \mid x_i, \theta). $$

This suggests that the structure of the conditionals does not differ, even though the assumptions about the marginals differ.


Questions:

  1. Are the equations for the joint likelihood in both cases accurate, and do they correctly reflect the respective independence or conditional independence assumptions?

  2. If so, does the assumption of full independence of $(x_i, y_i)$ (Case 1) imply the conditional independence of $y_i$ given $x_i$ as shown in observations?

$\endgroup$
8
  • 1
    $\begingroup$ You seem to be implicitly assuming the $x_i$ are independent of $\theta$. If the $x_i$ are independent of each other as implied by the $(x_i,y_i)$ being independent of each other, then yes, since that means $\prod P(x_i) = P(x_1,x_2, \ldots x_n)$ $\endgroup$ Commented Dec 20, 2024 at 13:32
  • $\begingroup$ @Henry I made that assumption since for discriminative models like regression we only need to estimate conditional distribution. $\endgroup$ Commented Dec 20, 2024 at 15:13
  • 2
    $\begingroup$ Neither likelihood expression looks correct. Consider expanding both out, in full detail, for the case $n=2,$ which involves four random variables $x_1,y_1,x_2,y_2.$ You do not have to reference $\theta$ in your notation for this purpose, because the question only concerns a single distribution (cc @Henry). $\endgroup$ Commented Dec 20, 2024 at 15:54
  • 2
    $\begingroup$ Doesn't stats.stackexchange.com/questions/436525 answer your question? You might also find stats.stackexchange.com/questions/116355 useful. $\endgroup$ Commented Dec 23, 2024 at 19:07
  • 1
    $\begingroup$ Here is a Meta.CV discussion about the answers to this question. $\endgroup$ Commented Dec 31, 2024 at 12:11

3 Answers 3

3
$\begingroup$

Your book's claim is rooted in the statistical assumptions underlying the regression model and how they influence the parameters' point-estimate process. The randomness or sampling mechanism of $x_i$ only affects how representative the data is of the underlying population. However, for the referenced OLS criterion for estimating parameters of linear regression models to be consistent with statistical inference principle, it only requires the existence of a same factored conditional probability $∀i.P(y_i|x_i)$ from your original full joint distribution as you correctly showed in OP for both hypothetical cases, not the randomness of $x_i$ itself. Note that the equations for the joint likelihood in both cases are accurate and correctly reflect respective independence or conditional independence assumption in OP.

Note the assumption of independence of all pairs $(x_i,y_i)$ as in Case 1 implies the conditional independence of all $Y_i|X_i, ∀i$, consistent with your referenced claim. Therefore your case 1 is a special case of case 2, and only their shared common part, that is, the existence of a common conditional probability $P(y_i|x_i)$, really causes the fact that the OLS criterion is valid under the statistical MLE inference view for point-estimate of regression model's parameters. For example, in linear regression model, this common conditional probability is often assumed to be a Gaussian due to the iid measurement white noise as the only global latent random source, treating all $x_i,y_i$ including parameters $β$ as observed known data or unknown fixed data. So the MLE of a joint iid Gaussians parameterized by all the fixed data values yields the OLS criterion.

$\endgroup$
10
  • $\begingroup$ I have a slightly unrelated followup question - How does the conditional independence assumption in regression, where $ y_i $'s are conditionally independent given $x_i $'s, relate to the traditional notion of conditional independence (as mentioned in the question in stats.stackexchange.com/questions/436525/…) ? Are they equivalent in meaning, or is the regression context a special case? $\endgroup$ Commented Jan 22 at 11:15
  • $\begingroup$ 1. Start with the product rule: $$ P(y_i, y_j | x_i, x_j) = P(y_i | y_j, x_i, x_j)P(y_j | x_i, x_j) $$ 2. Assume conditional independence of $y_i$ and $y_j$ given their predictors: $$ P(y_i | y_j, x_i, x_j) = P(y_i | x_i) $$ 3. Substitute this into the product rule: $$ P(y_i, y_j | x_i, x_j) = P(y_i | x_i)P(y_j | x_i, x_j) $$ 4. Assume independence of $y_j$ given its predictor $x_j$: $$ P(y_j | x_i, x_j) = P(y_j | x_j) $$ 5. Combine the results to get: $$ P(y_i, y_j | x_i, x_j) = P(y_i | x_i)P(y_j | x_j) $$ $\endgroup$ Commented Jan 22 at 11:39
  • 1
    $\begingroup$ I believe your above derivation makes sense and explicitly proved the existence of factored common independent conditional probabilities mentioned my above answer. Since each above pair $(x_i,y_i)$ is from a same joint data generating process as also assumed by regression models as I've explained above, so your derivation is the most general and abstract starting from the full joint data generating process. Perhaps also this latent factor explains your post statistics. $\endgroup$ Commented Jan 23 at 21:04
  • 1
    $\begingroup$ Yes, as every step in your above derivation is standard conditional probability invocation, there's no special other proof-theoretic version, regardless possible Bayesian or Frequentist interpretation. $\endgroup$ Commented Jan 23 at 21:25
  • 1
    $\begingroup$ Agree in peace. Since as I've mentioned in my answer that case 1 is really a special case of case 2 as its subset, and we only need their common part for the book's sensible claim. From model theory of mathematical logic we know a subset always 'implies' its inclusion set semantically. $\endgroup$ Commented Jan 25 at 6:16
1
+50
$\begingroup$

Consider a single underlying probability space with a measure $\mu_\theta$ that governs all possible realizations of $(X_i, Y_i)_{i=1}^n$. Full independence of each pair $(X_i, Y_i)$ means that for any measurable sets $A_i \subseteq \mathcal{X}$ and $B_i \subseteq \mathcal{Y}$, the measure satisfies:

$$ \mu_\theta\Bigl(\prod_{i=1}^n (A_i \times B_i)\Bigr) = \prod_{i=1}^n \mu_\theta\bigl(A_i \times B_i\bigr) $$

From the chain rule applied to each factor $P(x_i,y_i\mid\theta)$, one obtains:

$$ P(x_1,y_1,\dots,x_n,y_n\mid\theta) = \prod_{i=1}^n P(y_i\mid x_i,\theta)P(x_i\mid\theta) $$

Conditional independence of $Y_1,\dots,Y_n$ given $X_1,\dots,X_n$ requires that for almost every configuration of $(x_1,\dots,x_n)$, the conditional distribution factorizes as:

$$ P(y_1,\dots,y_n\mid x_1,\dots,x_n,\theta) = \prod_{i=1}^n P(y_i\mid x_i,\theta) $$

This imposes no factorization on $P(x_1,\dots,x_n\mid\theta)$, so the joint distribution becomes:

$$ P(x_1,y_1,\dots,x_n,y_n\mid\theta) = \Bigl[\prod_{i=1}^n P(y_i\mid x_i,\theta)\Bigr] P(x_1,\dots,x_n\mid\theta) $$

These factorizations match exactly the two scenarios discussed: full independence treats $(X_i,Y_i)$ as mutually independent pairs, while conditional independence concerns only the conditional structure of $Y_i$ given $X_i$. The former implies the latter, because if each pair is mutually independent, then:

$$ P(x_i,y_i\mid\theta) = P(y_i\mid x_i,\theta)P(x_i\mid \theta) $$

and the product over all $i$ yields:

$$ \prod_{i=1}^n P(x_i,y_i\mid\theta) = \Bigl[\prod_{i=1}^n P(y_i\mid x_i,\theta)\Bigr]\Bigl[\prod_{i=1}^n P(x_i\mid\theta)\Bigr] $$

which directly recovers the conditional factorization of the outputs.

We can also view this from the perspective of filtrations and martingale theory. Define two filtrations:

$$ \mathcal{F}_k = \sigma\bigl((X_1,Y_1),\dots,(X_k,Y_k)\bigr), \quad \mathcal{G}_k = \sigma\bigl(X_1,\dots,X_n,Y_1,\dots,Y_k\bigr) $$

In both independence regimes, one can show that:

$$ \mathbb{E}[Y_k\mid \mathcal{F}_{k-1}] = \mathbb{E}[Y_k\mid X_k], \quad \mathbb{E}[Y_k\mid \mathcal{G}_{k-1}] = \mathbb{E}[Y_k\mid X_k] $$

Hence both full independence and conditional independence imply that the sequence:

$$ M_k = \sum_{i=1}^k \bigl(Y_i - \mathbb{E}[Y_i\mid X_i]\bigr) $$

is a martingale with respect to its respective filtration. The difference emerges in the variability of this martingale: under full independence,

$$ \langle M\rangle_k = \sum_{i=1}^k \mathrm{Var}(Y_i\mid X_i) $$

whereas conditional independence allows:

$$ \langle M\rangle_k = \sum_{i=1}^k \mathrm{Var}(Y_i\mid X_i,\{X_j\}_{j\neq i}) $$

which reduces to the same sum as above if and only if:

$$ \mathrm{Var}(Y_i\mid X_i) = \mathrm{Var}(Y_i\mid X_i,\{X_j\}_{j\neq i}) $$

Full independence guarantees that no additional knowledge of other $(X_j,Y_j)$ can further reduce $\mathrm{Var}(Y_i\mid X_i)$. Under mere conditional independence, the first-order (martingale) structure is preserved but higher-order moment relationships may differ, revealing potential correlations among the residuals. Letting:

$$ \eta_k = Y_k - \mathbb{E}[Y_k\mid X_k] $$

yields a sequence of martingale differences under both structures, yet full independence enforces:

$$ \mathrm{Cov}(\eta_i,\eta_j\mid X_i,X_j)=0 \quad (i\neq j) $$

whereas conditional independence alone does not. The innovation process $\{\eta_k\}$ thus remains uncorrelated across $i,j$ only in the fully independent regime.

$\endgroup$
0
0
$\begingroup$

In the case of independent pairs

if the training observations $(x_i,y_i)$ represent independent random draws

then for a specific $k$, the distribution of $y_k,x_k$ is independent from the other $y_i$ and $x_i$. Conditioning on the other $x_i$ doesn't change anything about that.

You get that in both cases the distribution is just a product of individual terms

Unconditional

$$f(\mathbf{y},\mathbf{x}) = \prod_i f_i(y_i,x_i) = \prod_i g_i(y_i|x_i) h_i(x_i)$$

Conditional

$$f(\mathbf{y}|\mathbf{x}) = \frac{f(\mathbf{y},\mathbf{x})}{h(\mathbf{x})} = \frac{ \prod_i f_i(y_i,x_i)}{\prod_i h_i(x)} = \prod_i g_i(y_i|x_i)$$


In terms of your cases

For iid $x_i$ we have

$$P(x_1, x_2, \dots, x_n \mid \theta) = \prod_{i=1}^n P(x_i \mid \theta)$$

and the case 1 can always be written in the form of case 2.


More generally

A way how conditioning can change independence of $y_i$ into dependence of the $y_i$ is when the condition $x_i$ is not independent of the $y_k,x_k$ with $k\neq i$. Collider bias is an example (see also here: Can spurious correlations exist in the (theoretical) population?).

$\endgroup$
4
  • $\begingroup$ Thank you. Could you please clarify the answers to both of my questions? $\endgroup$ Commented Dec 30, 2024 at 16:11
  • $\begingroup$ If you don’t mind, could you add your answers to both the questions in my post explicitly to your answer? I’m happy to accept your answer and give the bounty post that. $\endgroup$ Commented Dec 30, 2024 at 20:48
  • $\begingroup$ Let us continue this discussion in chat. $\endgroup$ Commented Dec 30, 2024 at 22:48
  • $\begingroup$ Could you please explicitly answer my two questions? I still cannot figure out based on your existing answer unfortunately. $\endgroup$ Commented Dec 31, 2024 at 12:36

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.