9
$\begingroup$

I have cross sectional data and am using logistic regression. My question is how do I check my data for heteroskedasticity and in case it is present, then how to deal with it in Stata.

I have come across a lot of information using linear regression along with the Breusch-Pagan Test (using command - hettest) or White’s Test (using command - imtest) for testing for heteroskedasticity. And heteroskedasticity is dealt with by computation of - Robust Standard Errors. However, there is less information on this issue in case of logistic regression.

$\endgroup$

3 Answers 3

9
$\begingroup$

Except in a very technical sense (which @BigBendRegion's answer gets at) heteroskedasticity isn't a "thing' in a logistic regression model.

Heteroskedasticity is when the standard deviation of the errors around the regression line (that is the average distance between the predicted Y value at a given X value and the actual Y values in your dataset for cases with those X values) gets bigger or smaller as X increase. Now, many people (myself included) would argue that heteroskedasticity isn't even that big of a problem for LINEAR regression, except when it's caused by other more serious issues (like nonlinearity or omitted variable bias).

But this whole concept doesn't make sense in logit because logit models don't even HAVE error terms, or rather they don't have error terms that come from the data.

To oversimplify greatly, what a logit model actually "does" is run an OLS model on an unobserved latent variable (call it y*) that represents the "propensity" to do whatever it is your binary variable Y is measuring (we assume that people with a y* over some arbitrary threshold get a Y of 1 and everyone else gets a zero). Obviously we don't know what y* looks like, so in order to specify this model we assume that the errors in this OLS model have a logistic distribution (hence the name of the model) with a standard deviation of of $π/\sqrt{3}$ (the probit model assumes they are normally distributed with a standard deviation of 1). Through some calculus we use this assumption about the distribution of the errors in y* to get us to the logit model of Y itself. This means that the logit model doesn't have an error term, because the distribution of the errors is build into the assumptions of the model itself. So it doesn't make sense to talk about whether the errors get bigger or smaller as X increases, which is what heteroskedasticity is.

$\endgroup$
5
  • 2
    $\begingroup$ I disagree that there is no error term. The observed binary response minus the conditional expectation is the error term, and its variance is as I stated in my answer. $\endgroup$ Commented Jan 1, 2021 at 15:30
  • $\begingroup$ Any comments on the part of the question asking about why you would need Robust Standard Errors for logistic regression? $\endgroup$ Commented Apr 20, 2021 at 14:10
  • 1
    $\begingroup$ Logistic regression need not be thought of as having an error term. And robust standard errors sometimes are less precise than ordinary estimates. $\endgroup$ Commented Sep 16, 2023 at 11:12
  • $\begingroup$ Heteroscedasticity is to linear regression what scale effects are to logit models. Like in the linear regression case, it depends on what predictors and functional forms go into the regular formula (mean or location structure). However, testing scale effects in logit models can be fragile due to the "complete separation" problem: Not uncommon to provide a huge coefficient of the scale effect, singular Hessian, all standard errors are NA, or algorithm cannot converge. But they are worth examining. Williams wrote Stata packages for these methods. See my answer. $\endgroup$ Commented Apr 8 at 11:42
  • $\begingroup$ @BigBendRegion that is an artificial construct in this setting and doesn’t help the discussion. $\endgroup$ Commented Apr 8 at 14:57
6
$\begingroup$

With the logistic regression model, heteroscedasticity is automatically assumed to exist. The conditional distribution of $Y$ given $X=x$ is assumed to be Bernoulli with parameter $\pi(x)$, a probability. The variance of this distribution is $\pi(x)\times (1-\pi(x))$, a nonconstant function of $x$. Likewise, you do not need to worry about normality. You still need to consider the linearity (in the logits) and independence assumptions, however.

$\endgroup$
3
  • 1
    $\begingroup$ Any comments on the part of the question asking about why you would need Robust Standard Errors for logistic regression? $\endgroup$ Commented Apr 20, 2021 at 14:10
  • $\begingroup$ Why would you need them? The standard errors automatically account for heteroscedasticity correctly. Using robust standard errors would just add noise. $\endgroup$ Commented Apr 21, 2021 at 19:11
  • 1
    $\begingroup$ I don't think the "heteroscedasticity" you described here is the same thing as questioned by the poster. Yours is the parametric variance formula given to a Bernoulli-distributed random variable. Thus, the Y you gave is 0/1 binary and its variance varies with X if we plot Y over X. The heteroscedasticity mentioned by the questioner is about whether the assumed error term in a latent-response formulation of logistic regression has constant variances across observations, an assumption necessary to derive the model specification and likelihood function. See my answer. $\endgroup$ Commented Sep 24, 2023 at 8:03
1
$\begingroup$

To test whether the error term in a latent-response formulation of logistic regression has different variances across observations, a researcher has to assume certain heteroskedasticity patterns expressed with specific predictors of the scale effect in cumulative link models, of which binary logistic regression is a special case, and test them, with likelihood ratio tests for example. In contrast, what @BigBendRegion describes is the variance of the binary response $Y \in \{0, 1\}$ deviated from its assumed mean $\Pr(Y = 1 | X = x)$, which should not be compared to the constant variance regarding the error term in linear regression.

There are two specification tests for binary and ordinal regression, the Hosmer-Lemeshow test and the Lipsitz test, to test bias in the predicted probabilities. No need to use robust standard errors in discrete choice models. If the model specification (predictor inclusion and functional form of the location and scale structures) is correct, robust SE is less efficient than regular SE; if the model specification is incorrect, robust SE is generated for inconsistent point estimates, which does not correct the most substantive problem. Therefore, using robust SE does not remedy the heteroscedasticity in the error term of logistic regression.

Instead of using robust SE, a researcher should perform specification tests after fitting a logistic regression model to screen lack of fit and examine potential nonlinear functions of predictors, such as interaction, squared, cubic, and logarithm terms, in both the location and scale equations.

See tutorials

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.