2
$\begingroup$

I came across this article: “MSE is Cross Entropy at Heart: Maximum Likelihood Estimation Explained” which states:

"When training a neural network, we are trying to find the parameters of a probability distribution which is as close as possible to the distribution of the training set."

This makes sense when the model is learning the un-conditional distribution of the data, assuming that the true data-generating process is IID. In that case, we can write the average log likelihood as the expectation of the model probability with respect to the empirical probability of the data:

$$ \frac{1}{N} \sum_{i=1}^{N} \log p_{\theta}(x_i) \quad \text{or equivalently} \quad \mathbb{E}_{\hat{p}_{\text{data}}}[\log p_{\theta}(x)] $$

For conditional models, we typically write a similar expression using conditional probabilities:

$$ \frac{1}{N} \sum_{i=1}^{N} \log p_{\theta}(y_i \mid x_i) \quad \text{or equivalently} \quad \mathbb{E}_{\hat{p}_{\text{data}}}[\log p_{\theta}(y \mid x)] $$

However, I have a couple of questions regarding this formulation:

  1. Conditional Independence and Cross-Entropy Equivalence:
    For conditional models, we often only assume conditional independence (see this discussion). Does this imply that the log likelihood in the conditional case would not always be equivalent to the cross-entropy with the empirical data distribution unless the data-generating process is IID? Is my understanding correct?

  2. Log Likelihood and Conditional Empirical Distributions:
    In general, why is the log likelihood not calculated with respect to a conditional empirical data distribution for conditional models? In other words, why do we directly use the expectation:

    $$ \mathbb{E}_{\hat{p}_{\text{data}}(x,y)}[\log p_{\theta}(y \mid x)] $$

    rather than formulating it in terms of a conditional empirical distribution $\hat{p}_{\text{data}}(y \mid x)$?

Any insights or references that could help clarify these points would be much appreciated!

$\endgroup$
1
  • 1
    $\begingroup$ The main common thing is that you can derive cross-entropy loss and MSE as a form of log likelihood function. That is it. Not more, not less. So stop hand-waving and start with the likelihood function $\endgroup$ Commented Feb 17 at 18:32

1 Answer 1

3
+50
$\begingroup$

When we use the empirical data distribution $\hat{p}_{\text{data}}(x,y)$ to approximate the true data-generating process, the iid assumption guarantees that each sample is an independent draw from the true distribution, ensuring the consistency of MLE, that is, such estimated parameter values converges to the true parameter values as more data are collected. Therefore if the empirical data are dependent non-iid, the effective sample size will be much smaller than the actual number of observations, meaning that the bias or variance of your MLE/CE result will be much higher than what you'd expect under the iid assumption and the simple average over log-likelihoods might no longer be a good approximation of the true expected log-likelihood. You might need to model the dependency structure explicitly such as using autoregressive or state-space models with Markov transition kernels and ACF to account for the reduced effective sample size and adjusted confidence interval for point-estimate of parameters.

Hence, while the MLE for $\theta$ is consistent under the conditional independence assumption, dependencies among $x_i$ affect accuracy with effective finite samples and confidence intervals of $\theta$ may be misestimated if such dependencies are ignored. In practice it needs to account for these dependencies with autoregressive model.

Note - When you have multiple independent sequences, each representing a separate realization of a non-iid process, treating each sequence as an independent sample can mitigate the issue of low ESS, allowing for more accurate estimation of model parameters.

For your second sub-question, in theory the joint empirical data distribution $\hat{p}_{\text{data}}(x,y)$ already contains all the information about the marginal distribution $\hat{p}_{\text{data}}(x)$ and the conditional $\hat{p}_{\text{data}}(x|y)$ under the chain rule, thus you have $\mathbb{E}_{(x,y) \sim \hat{p}_{\text{data}}(x,y)}[\log p_{\theta}(y \mid x)]=\mathbb{E}_{x \sim \hat{p}_{\text{data}}(x)}[\mathbb{E}_{y \sim \hat{p}_{\text{data}}(y|x)}[\log p_{\theta}(y \mid x)]]$. In supervised learning, we have access to discrete pairs $(x_i,y_i)$ and the training objective is computed directly on these pairs without needing to explicitly construct the conditional distribution $\hat{p}_{\text{data}}(x|y)$ for every $x$. Finally since in reality almost all discrete empirical data values are different, you cannot sum over to get the empirical marginal and the empirical conditional distributions.

Note - Empirical data is jointly observed in supervised learning thus joint empirical distribution as mass frequencies is always correct by definition so is chain rule, that is, data themselves are always trustable intuitively

$\endgroup$
10
  • $\begingroup$ Thanks! For my first sub-question- what would it mean for conditional models which only assume conditional independence (no assumptions made on joint distribution)? In such cases we know that the MLE converges and is consistent as discussed stats.stackexchange.com/q/659012/408276 . What would it actually converge to given the data could have dependence or be non-identically distributed? Would the chain rule of expectation still hold, since, the joint empirical distribution would be inaccurate in such a case. Are we relying on the inner expectation in such cases? Plz elaborate :) $\endgroup$ Commented Feb 19 at 14:23
  • 1
    $\begingroup$ Usually empirical data is jointly observed in supervised learning thus joint empirical distribution as mass frequencies is always correct by definition so is chain rule, that is, data themselves are always trustable intuitively. As mentioned in my first paragraph above, while the MLE for $θ$ is consistent under the conditional independence assumption, dependencies among $x_i$ affect accuracy with effective finite samples and confidence intervals of $θ$ may be misestimated if such dependencies are ignored. In practice it needs to account for these dependencies with autoregressive model. $\endgroup$ Commented Feb 20 at 7:52
  • $\begingroup$ I had some additional thoughts after reading your answer here stats.stackexchange.com/a/651121/408276 . If we have $M$ independent sequential samples from the same non IID process and we consider each sample as one entire independent sequence, there should not be the issue of low ESS, right? Only if each sample (sequence or otherwise) is taken from a completely different DGP would we have the problem of low ESS? Ofc I’m referring to conditional models assuming conditional independence and the same model $p(y_i|x_i)$ $\endgroup$ Commented Feb 21 at 9:36
  • $\begingroup$ Sorry, another question comes to mind - you mentioned joint empirical data distribution $\hat{p}_{\text{data}}(x,y)$ already contains all the information about the marginal distribution $\hat{p}_{\text{data}}(x)$ and the conditional $\hat{p}_{\text{data}}(x|y)$ via the chain rule. Lets say we only use the inner expectation , would we get the same parameter estimate? @cinch please let me know if you think we should have a separate question for this as I appreciate these are a few more questions you would need time on. $\endgroup$ Commented Feb 21 at 15:17
  • $\begingroup$ For your first sub-question above you're right. When you have multiple independent sequences, each representing a separate realization of a non-iid process, treating each sequence as an independent sample can mitigate the issue of low ESS, allowing for more accurate estimation of model parameters. However, if each sequence is drawn from a different DGP, the assumption of conditional independence may not hold. In such case the data points within each sequence might exhibit dependencies that are not captured by the model, leading to smaller ESS. $\endgroup$ Commented Feb 22 at 8:16

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.