Why do MSE and cross-entropy losses have the same gradient?

Question

I'm a data science student, and while I was learning to derive the logistic regression loss function (cross-entropy loss), I found that the gradient is exactly the same as the least-squares gradient for linear regression, even though the two functions look very different. Can someone explain why this is the case? or is it mere coincedence?

underflow · Accepted Answer · 2024-11-20 20:14:49Z

The equivalence of gradients between Mean Squared Error (MSE) and Cross-Entropy (CE) losses in logistic regression is not a coincidence. It arises from the mathematical properties of the logistic function, particularly its derivative, which naturally scales gradients in a consistent way. This results in an elegant mathematical connection between two seemingly different loss functions.

1. MSE Loss

Mean Squared Error is commonly used in regression tasks and is defined as:

$$ \text{MSE}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2, \quad \nabla \text{MSE} = -\frac{2}{n} \sum_{i=1}^n (y_i - \hat{y}_i). $$

2. CE Loss

Cross-Entropy loss is typically used for classification tasks and is defined as:

$$ \text{CE}(y, \hat{y}) = -\frac{1}{n} \sum_{i=1}^n \left( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right), $$

with its gradient:

$$ \nabla \text{CE} = -\frac{1}{n} \sum_{i=1}^n \left( \frac{y_i}{\hat{y}_i} - \frac{1 - y_i}{1 - \hat{y}_i} \right). $$

Logistic Regression

Logistic regression predicts probabilities via the logistic function, which is a specific type of sigmoid function:

$$ \hat{y} = \sigma(Xw) = \frac{1}{1 + e^{-Xw}}. $$

The derivative of the logistic function is:

$$ \sigma'(z) = \sigma(z)(1 - \sigma(z)). $$

This property plays an essential role in simplifying the gradient computation.

Why the Gradients Match

1. Gradient Derivation

The logistic function transforms the linear predictor $Xw$ into probabilities, linking $y$ and $\hat{y}$. When calculating the gradient of CE loss, assuming $\hat{y}_i = \sigma(Xw)$, the derivative of the logistic function, $\sigma'(z) = \sigma(z)(1 - \sigma(z))$, cancels terms introduced by CE during the application of the chain rule. As a result, the gradient simplifies as:

$$ \nabla \text{CE} = -\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i), $$

which is identical to $\nabla \text{MSE}$.

2. Error Penalization

While MSE and CE penalize errors differently, the logistic transformation causes their gradients to take the same form in the context of logistic regression. This consistent scaling effect of the logistic function's derivative ensures this equivalence.

3. Probabilistic Foundation

CE loss is generally preferred for classification tasks because it directly models the likelihood of the data. CE penalizes confident incorrect predictions more heavily, ensuring the model probabilities are better calibrated to reflect the true labels, which is especially beneficial for imbalanced datasets. MSE, in contrast, treats errors symmetrically and focuses on their magnitude, which can lead to slower convergence in these scenarios.

Applicability to Logistic Regression and Related Models

The gradient equivalence between MSE and CE is a feature of logistic regression in binary classification. This behavior is driven by the logistic function’s properties and its interaction with the loss functions. While this equivalence is often associated with logistic regression, it can extend to other models with similar structures, such as:

Generalized linear models using link functions that exhibit similar scaling properties, such as the probit link.
Neural networks with sigmoid activations when binary CE loss is used.

For multiclass classification, this equivalence does not hold because the softmax function and categorical CE loss introduce additional complexity in gradient computation. Similarly, in generalized linear models, this behavior depends on the specific choice of link function and may not occur universally.

References

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.

Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Stack Exchange Network