I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(q, p) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$$$H(p, q) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(q, p) - H(p)$$$$D_{KL}(p, q) = H(p, q) - H(p)$$ From the equation, we could see that KL divergence can departbe split into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).
In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.