Skip to main content
added 1 character in body
Source Link
Tripartio
  • 2.4k
  • 1
  • 21
  • 28

If your research is in a discipline that does not prefer one or the other, then my study of this question (which is better, logit or probit) has led me to conclude that it is generally better to use probit, since it almost always will give a statistical fit to data that is equal or superior to that of the logit model. The most notable exception when logit models give a better fit is in the case of "extreme independent variables" (which I defineexplain below).

If your research is in a discipline that does not prefer one or the other, then my study of this question (which is better, logit or probit) has led me to conclude that it is generally better to use probit, since it almost always will give a statistical fit to data that is equal or superior to that of the logit model. The most notable exception when logit models give a better fit is in the case of "extreme independent variables" (which I define below).

If your research is in a discipline that does not prefer one or the other, then my study of this question (which is better, logit or probit) has led me to conclude that it is generally better to use probit, since it almost always will give a statistical fit to data that is equal or superior to that of the logit model. The most notable exception when logit models give a better fit is in the case of "extreme independent variables" (which I explain below).

Source Link
Tripartio
  • 2.4k
  • 1
  • 21
  • 28

I offer a practical answer to the question, that only focuses on "when to use logistic regression, and when to use probit", without getting into statistical details, but rather focusing on decisions based on statistics. The answer depends on two main things: do you have a disciplinary preference, and do you only care for which model better fits your data?

Basic difference

Both logit and probit models provide statistical models that give the probability that a dependent response variable would be 0 or 1. They are very similar and often given practically idential results, but because they use different functions to calculate the probabilities, their results are sometimes slightly different.

Disciplinary preference

Some academic disciplines generally prefer one or the other. If you are going to publish or present your results to an academic discipline with a specific traditional preference, then let that dictate your choice so that your findings would be more readily acceptable. For example (from Methods Consultants),

Logit – also known as logistic regression – is more popular in health sciences like epidemiology partly because coefficients can be interpreted in terms of odds ratios. Probit models can be generalized to account for non-constant error variances in more advanced econometric settings (known as heteroskedastic probit models) and hence are used in some contexts by economists and political scientists.

The point is that the differences in results are so minor that the ability for your general audience to understand your results outweigh the minor differences between the two approaches.

If all you care about is better fit...

If your research is in a discipline that does not prefer one or the other, then my study of this question (which is better, logit or probit) has led me to conclude that it is generally better to use probit, since it almost always will give a statistical fit to data that is equal or superior to that of the logit model. The most notable exception when logit models give a better fit is in the case of "extreme independent variables" (which I define below).

My conclusion is based almost entirely (after searching numerous other sources) on Hahn, E.D. & Soyer, R., 2005. Probit and logit models: Differences in the multivariate realm. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.329.4866&rep=rep1&type=pdf. Here is my summary of the practical decision conclusions from this article concerning whether logit versus probit multivariate models provide a better fit to the data (these conclusions also apply to univariate models, but they only simulated effects for two independent variables):

  • In most scenarios, the logit and probit models fit the data equally well, with the following two exceptions.

  • Logit is definitely better in the case of "extreme independent variables". These are independent variables where one particularly large or small value will overwhelmingly often determine whether the dependent variable is a 0 or a 1, overriding the effects of most other variables. Hahn and Soyer formally define it thus (p. 4):

An extreme independent variable level involves the confluence of three events. First, an extreme independent variable level occurs at the upper or lower extreme of an independent variable. For example, say the independent variable x were to take on the values 1, 2, and 3.2. The extreme independent variable level would involve the values at x = 3.2 (or x = 1). Second, a substantial proportion (e.g., 60%) of the total n must be at this level. Third, the probability of success at this level should itself be extreme (e.g., greater than 99%).

  • Probit is better in the case of "random effects models" with moderate or large sample sizes (it is equal to logit for small sample sizes). For fixed effects models, probit and logit are equally good. I don't really understand what Hahn and Soyer mean by "random effects models" in their article. Although many definitions are offered (as in this Stack Exchange question), the definition of the term is in fact ambiguous and inconsistent. But since logit is never superior to probit in this regard, the point is rendered moot by simply choosing probit.

Based on Hahn and Soyer's analysis, my conclusion is to always use probit models except in the case of extreme independent variables, in which case logit should be chosen. Extreme independent variables are not all that common, and should be quite easy to recognize. With this rule of thumb, it doesn't matter whether the model is a random effects model or not. In cases where a model is a random effects model (where probit is preferred) but there are extreme independent variables (where logit is preferred), although Hahn and Soyer didn't comment on this, my impression from their article is that the effect of extreme independent variables are more dominant, and so logit would be preferred.