I came across this article: “MSE is Cross Entropy at Heart: Maximum Likelihood Estimation Explained” which states:
"When training a neural network, we are trying to find the parameters of a probability distribution which is as close as possible to the distribution of the training set."
This makes sense when the model is learning the un-conditional distribution of the data, assuming that the true data-generating process is IID. In that case, we can write the average log likelihood as the expectation of the model probability with respect to the empirical probability of the data:
$$ \frac{1}{N} \sum_{i=1}^{N} \log p_{\theta}(x_i) \quad \text{or equivalently} \quad \mathbb{E}_{\hat{p}_{\text{data}}}[\log p_{\theta}(x)] $$
For conditional models, we typically write a similar expression using conditional probabilities:
$$ \frac{1}{N} \sum_{i=1}^{N} \log p_{\theta}(y_i \mid x_i) \quad \text{or equivalently} \quad \mathbb{E}_{\hat{p}_{\text{data}}}[\log p_{\theta}(y \mid x)] $$
However, I have a couple of questions regarding this formulation:
Conditional Independence and Cross-Entropy Equivalence:
For conditional models, we often only assume conditional independence (see this discussion). Does this imply that the log likelihood in the conditional case would not always be equivalent to the cross-entropy with the empirical data distribution unless the data-generating process is IID? Is my understanding correct?Log Likelihood and Conditional Empirical Distributions:
In general, why is the log likelihood not calculated with respect to a conditional empirical data distribution for conditional models? In other words, why do we directly use the expectation:$$ \mathbb{E}_{\hat{p}_{\text{data}}(x,y)}[\log p_{\theta}(y \mid x)] $$
rather than formulating it in terms of a conditional empirical distribution $\hat{p}_{\text{data}}(y \mid x)$?
Any insights or references that could help clarify these points would be much appreciated!