4
$\begingroup$

There are numerous material that show the relationship between MLE and cross-entropy. Typically, these are the steps taken to show the relationship for a I.I.D data generating process $D = (X,Y)$:

$$ L(D) = \prod_{i=1}^N p(x_i, y_i; \theta) $$

Divide the likelihood by num. samples $N$ and take a $\log$ on both sides as both of these operations will not affect the optimal model parameter estimation $\theta^*$

$$ \frac{1}{N} \times \log(L(D)) = \frac{1}{N} \times \sum_{i=1}^N log(p(x_i, y_i; \theta)) $$

And finally this is equivalent to the cross entropy between empirical distribution and model distribution.

$$ \frac{1}{N} \times \sum_{i=1}^N log(p(x_i, y_i; \theta)) = \mathbb{E}_{p_{data}}[log(p_{model}(x,y;\theta))] $$

I have a few questions:

  1. What if the data generating process is not I.I.D? Does this relationship still hold?

  2. Why is this relationship special and how does it help with parameter estimation? Given that both MLE and cross entropy gives the exact same optimal model parameter $\theta^*$.

$\endgroup$
2
  • $\begingroup$ Related $\endgroup$ Commented Jul 13, 2024 at 19:11
  • $\begingroup$ @Dave thank you. It gives some clarity but still do not think completely answers 1. or 2. above. $\endgroup$ Commented Jul 14, 2024 at 13:10

1 Answer 1

3
$\begingroup$

As you correctly claimed both MLE and CE give the exactly same optimal model parameters (at least for iid cases), there's no theoretical advantage of either objective to learn the usual point estimate of those parameters except providing a connection between statistical inference and information theory. And this theoretical connection is not surprising since both are based on a common probability model. In practice, CE formulation is easier for classification problems and MLE formulation for regression is often replaced by MSE loss function.

If the data generating process (DGP) is not iid, there's nothing to prevent us from obtaining the joint likelihood function in principle. For example, for non-iid but infinitely exchangeable sequences such as consecutively drawing cards from an unknown deck without replacement, according to De Finetti's theorem we can always use some hierarchical (fully) Bayesian model to calculate the joint likelihood via some global and/or local latent parameters. Even for non-exchangeable sequence of data such as a Markov chain we can still calculate any trajectory likelihood based on initial state distribution and transition probabilities.

Though we can often have tractable joint likelihood function or its approximation through ELBO or MCMC when it's intractable for non-iid cases, you cannot express its negative log likelihood (NLL) as sum of logs of the same probability as expressed in your question for iid cases. In general you can only treat the non-iid DGP's data block as a single random variable, therefore it's meaningless to form CE since otherwise CE would be a product of two terms only (one is empirical probability for the non-iid block of observed data, another is the above single term NLL).

But we can sample many other, say $M$, similar cross-sectional blocks of non-iid data independently from the same non-iid DGP such as MDPs or a set of similar images, then in such cross-sectional iid cases we can again form meaningful CE by scaling the sum of each data block's NLL by the cross-sectional data block's empirical distribution $1/M$.

$\endgroup$
11
  • $\begingroup$ Thank you @cinch. I now understand part 1. of my questions. You mention “Then you can always define CE as the scaled negative log joint-likelihood via the empirical data distribution 1/N.” For non I.I.D. data would computing the empirical distribution using Dirac delta distribution (divided by num. samples N) make sense and give an accurate estimate of the empirical distribution? I am not clear on this part. $\endgroup$ Commented Jul 17, 2024 at 12:57
  • 1
    $\begingroup$ Given any empirical data, CE is the expectation of model NLL wrt another usually different distribution and the only meaningful one here is the empirical distribution. Yet you seem to try to use already assumed true non-iid data distribution to manipulate the empirical distribution solely for CE or KL purpose. Apparently it's meaningless to have a CE crossed with the same assumed distribution, and in statistical inference any manipulation with empirical data is discouraged or cautioned. $\endgroup$ Commented Jul 17, 2024 at 23:06
  • 1
    $\begingroup$ As I've explained in my answer the only special thing about non-iid case's CE is that since there's only one single sequence of all given data with its negative log joint-likelihood (NLL) and the corresponding empirical probability is obviously $(1/N)^N$ as my mentioned scaling constant, therefore such CE is essentially a product of the two above terms, not the usual sum terms out of their common support. Hope this completely clarifies any of your remaining concerns. $\endgroup$ Commented Jul 17, 2024 at 23:13
  • $\begingroup$ “the only special thing about non-iid case's CE is that since there's only one single sequence of all given data with its negative log joint-likelihood (NLL) and the corresponding empirical probability” @cinch if you wouldn’t mind, could you derive the formula for this case as it might make it a bit easier to grasp. In particular I would like to understand what you mean by one single sequence of all given data with its NLL? $\endgroup$ Commented Jul 18, 2024 at 20:11
  • $\begingroup$ My comments were upvoted and I thought you're clarified. CE is ${H(p,q)=-E_p[log q]}$, and you've understood for non-iid cases we can still get NLL $-logq$ for any dependent sequence of observed data, at least theoretically. Also since this joint-likelihood is only about one single observed sequence for non-iid cases, so distribution $q$'s support only has one single possible value. Then to form above meaningful CE obviously you need to sample many other similar sequences of data, say the total number of such sequences is $M$ with each sequence's empirical distribution as $1/M$. OTOH, $\endgroup$ Commented Jul 18, 2024 at 23:58

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.