I am reading The Elements of Statistical Learning (page 44) and came across this statement:
"From a statistical point of view, this criterion is reasonable if the training observations $(x_i, y_i)$ represent independent random draws from their population. Even if the $x_i$’s were not drawn randomly, the criterion is still valid if the $y_i$’s are conditionally independent given the inputs $x_i$."
To explore this, I considered the joint likelihood in two scenarios:
Case 1: Full Independence of $(x_i, y_i)$
If the $(x_i, y_i)$ pairs are fully independent, the joint likelihood is:
$$ P(x_1, y_1, \dots, x_n, y_n \mid \theta) = \prod_{i=1}^n P(y_i \mid x_i, \theta) P(x_i \mid \theta). $$Case 2: Conditional Independence of $y_i$ Given $x_i$
If the $y_i$’s are conditionally independent given $x_i$, the joint likelihood is:
$$ P(x_1, y_1, \dots, x_n, y_n \mid \theta) = \left( \prod_{i=1}^n P(y_i \mid x_i, \theta) \right) P(x_1, x_2, \dots, x_n \mid \theta). $$
Observations:
Marginals:
In Case 1, the marginal distribution of the inputs factorizes as $\prod_{i=1}^n P(x_i \mid \theta)$ due to the independence assumption.
In Case 2, the marginal distribution of the inputs is written as a single term $P(x_1, x_2, \dots, x_n \mid \theta)$ since no independence is assumed for $x_i$.Conditionals:
In both cases, factoring out the marginals, the conditional distribution of $y_i$ given $x_i$ is identical:
$$ \prod_{i=1}^n P(y_i \mid x_i, \theta). $$
This suggests that the structure of the conditionals does not differ, even though the assumptions about the marginals differ.
Questions:
Are the equations for the joint likelihood in both cases accurate, and do they correctly reflect the respective independence or conditional independence assumptions?
If so, does the assumption of full independence of $(x_i, y_i)$ (Case 1) imply the conditional independence of $y_i$ given $x_i$ as shown in observations?