As you correctly claimed both MLE and CE give the exactly same optimal model parameters (at least for iid cases), there's no theoretical advantage of either objective to learn the usual point estimate of those parameters except providing a connection between statistical inference and information theory. And this theoretical connection is not surprising since both are based on a common probability model. In practice, CE formulation is easier for classification problems and MLE formulation for regression is often replaced by MSE loss function.
If the data generating process (DGP) is not iid, there's nothing to prevent us from obtaining the joint likelihood function in principle. For example, for non-iid but infinitely exchangeable sequences such as consecutively drawing cards from an unknown deck without replacement, according to De Finetti's theorem we can always use some hierarchical (fully) Bayesian model to calculate the joint likelihood via some global and/or local latent parameters. Even for non-exchangeable sequence of data such as a Markov chain we can still calculate any trajectory likelihood based on initial state distribution and transition probabilities.
Though we can often have tractable joint likelihood function or its approximation through ELBO or MCMC when it's intractable for non-iid cases, you cannot express its negative log likelihood (NLL) as sum of logs of the same probability as expressed in your question for iid cases. In general you can only treat the non-iid DGP's data block as a single random variable, therefore it's meaningless to form CE since otherwise CE would be a product of two terms only (one is empirical probability for the non-iid block of observed data, another is the above single term NLL).
But we can sample many other, say $M$, similar cross-sectional blocks of non-iid data independently from the same non-iid DGP such as MDPs or a set of similar images, then in such cross-sectional iid cases we can again form meaningful CE by scaling the sum of each data block's NLL by the cross-sectional data block's empirical distribution $1/M$.