248

324

What is the difference between Logit and Probit model?

I'm more interested here in knowing when to use logistic regression, and when to use Probit.

If there is any literature which defines it using R, that would be helpful as well.

248

324

What is the difference between Logit and Probit model?

I'm more interested here in knowing when to use logistic regression, and when to use Probit.

If there is any literature which defines it using R, that would be helpful as well.

123

The difference mainly in the link function.

In Logit: $\Pr(Y=1 \mid X) = [1 + e^{-X'\beta}]^{-1} $

In Probit: $\Pr(Y=1 \mid X) = \Phi(X'\beta)$ (Cumulative normal pdf)

In other way, logistic has slightly flatter tails. i.e probit curve approaches the axes more quickly than the curve.

Logit has better interpretation than probit. Logistic regression can be interpreted as modeling log odds. Usually people start the modeling with logit. You could use likelihood value to decide logit or probit.

Well, could you give examples in which logit fails compared to probit? I cannot find the ones you have in mind. – Wok – 2014-08-13T13:44:35.017

What is the relationship between X and X' ?? – flies – 2015-10-08T17:23:27.770

6Thanks for your answer Vinux. But I also want to know when to use logit, and to use probit. I know logit is more popular than probit, and majority of the cases we use logit regression. But there are some cases where Probit models are more useful. Can you please tell me what are those cases. And how to distinguish those cases from regular cases. – Beta – 2012-01-03T09:13:06.393

5When you are concerned with tail part of the curve, sometime the selection of logit or probit matters. There is no exact rule to select probit or logit. You can select model by looking at likelihood (or log likelihood) or AIC. – vinux – 2012-01-03T09:16:18.090

11Thanks for the advice! Can you elaborate on how to select between logit vs probit? In particular: (1) How do I tell when you are concerned with the tail part of the curve? (2) How do I select a model by looking at likelihood, log likelihood, or AIC? What specifically should I look at, and how should this influence my decision about which model to use? – D.W. – 2012-03-10T01:09:34.350

Yes but, is there any theory behind which one to fits better binomial data? – skan – 2016-10-25T18:38:43.750

1@flies Here $X'$ denotes the transpose of the matrix $X$. – Mathemanic – 2016-12-02T22:30:57.047

383

A standard linear model (e.g., a simple regression model) can be thought of as having two 'parts'. These are called the *structural component* and the *random component*. For example:

$$
Y=\beta_0+\beta_1X+\varepsilon \\
\text{where } \varepsilon\sim\mathcal{N}(0,\sigma^2)
$$
The first two terms (that is, $\beta_0+\beta_1X$) constitute the structural component, and the $\varepsilon$ (which indicates a normally distributed error term) is the random component. When the response variable is not normally distributed (for example, if your response variable is binary) this approach may no longer be valid. The generalized linear model (GLiM) was developed to address such cases, and logit and probit models are special cases of GLiMs that are appropriate for binary variables (or multi-category response variables with some adaptations to the process). A GLiM has three parts, a *structural component*, a *link function*, and a *response distribution*. For example:

$$
g(\mu)=\beta_0+\beta_1X
$$
Here $\beta_0+\beta_1X$ is again the structural component, $g()$ is the link function, and $\mu$ is a mean of a conditional response distribution at a given point in the covariate space. The way we think about the structural component here doesn't really differ from how we think about it with standard linear models; in fact, that's one of the great advantages of GLiMs. Because for many distributions the variance is a function of the mean, having fit a conditional mean (and given that you stipulated a response distribution), you have automatically accounted for the analog of the random component in a linear model (N.B.: this can be more complicated in practice).

The link function is the key to GLiMs: since the distribution of the response variable is non-normal, it's what lets us connect the structural component to the response--it 'links' them (hence the name). It's also the key to your question, since the logit and probit are links (as @vinux explained), and understanding link functions will allow us to intelligently choose when to use which one. Although there can be many link functions that can be acceptable, often there is one that is special. Without wanting to get too far into the weeds (this can get very technical) the predicted mean, $\mu$, will not necessarily be mathematically the same as the response distribution's *canonical location parameter*; the link function that does equate them is the *canonical link function*. The advantage of this "is that a minimal sufficient statistic for $\beta$ exists" (German Rodriguez). The canonical link for binary response data (more specifically, the binomial distribution) is the logit. However, there are lots of functions that can map the structural component onto the interval $(0,1)$, and thus be acceptable; the probit is also popular, but there are yet other options that are sometimes used (such as the complementary log log, $\ln(-\ln(1-\mu))$, often called 'cloglog'). Thus, there are lots of possible link functions and the choice of link function can be very important. The choice should be made based on some combination of:

- Knowledge of the response distribution,
- Theoretical considerations, and
- Empirical fit to the data.

Having covered a little of conceptual background needed to understand these ideas more clearly (forgive me), I will explain how these considerations can be used to guide your choice of link. (Let me note that I think @David's comment accurately captures why different links are chosen *in practice*.) To start with, if your response variable is the outcome of a Bernoulli trial (that is, $0$ or $1$), your response distribution will be binomial, and what you are actually modeling is the probability of an observation being a $1$ (that is, $\pi(Y=1)$). As a result, any function that maps the real number line, $(-\infty,+\infty)$, to the interval $(0,1)$ will work.

From the point of view of your substantive theory, if you are thinking of your covariates as *directly* connected to the probability of success, then you would typically choose logistic regression because it is the canonical link. However, consider the following example: You are asked to model `high_Blood_Pressure`

as a function of some covariates. Blood pressure itself is normally distributed in the population (I don't actually know that, but it seems reasonable prima facie), nonetheless, clinicians dichotomized it during the study (that is, they only recorded 'high-BP' or 'normal'). In this case, probit would be preferable a-priori for theoretical reasons. This is what @Elvis meant by "your binary outcome depends on a hidden Gaussian variable". Another consideration is that both logit and probit are *symmetrical*, if you believe that the probability of success rises slowly from zero, but then tapers off more quickly as it approaches one, the cloglog is called for, etc.

Lastly, note that the empirical fit of the model to the data is unlikely to be of assistance in selecting a link, unless the shapes of the link functions in question differ substantially (of which, the logit and probit do not). For instance, consider the following simulation:

```
set.seed(1)
probLower = vector(length=1000)
for(i in 1:1000){
x = rnorm(1000)
y = rbinom(n=1000, size=1, prob=pnorm(x))
logitModel = glm(y~x, family=binomial(link="logit"))
probitModel = glm(y~x, family=binomial(link="probit"))
probLower[i] = deviance(probitModel)<deviance(logitModel)
}
sum(probLower)/1000
[1] 0.695
```

Even when we know the data were generated by a probit model, and we have 1000 data points, the probit model only yields a better fit 70% of the time, and even then, often by only a trivial amount. Consider the last iteration:

```
deviance(probitModel)
[1] 1025.759
deviance(logitModel)
[1] 1026.366
deviance(logitModel)-deviance(probitModel)
[1] 0.6076806
```

The reason for this is simply that the logit and probit link functions yield very similar outputs when given the same inputs.

The logit and probit functions are practically identical, except that the logit is slightly further from the bounds when they 'turn the corner', as @vinux stated. (Note that to get the logit and the probit to align optimally, the logit's $\beta_1$ must be $\approx 1.7$ times the corresponding slope value for the probit. In addition, I could have shifted the cloglog over slightly so that they would lay on top of each other more, but I left it to the side to keep the figure more readable.) Notice that the cloglog is asymmetrical whereas the others are not; it starts pulling away from 0 earlier, but more slowly, and approaches close to 1 and then turns sharply.

A couple more things can be said about link functions. First, considering the *identity function* ($g(\eta)=\eta$) as a link function allows us to understand the standard linear model as a special case of the generalized linear model (that is, the response distribution is normal, and the link is the identity function). It's also important to recognize that whatever transformation the link instantiates is properly applied to the *parameter* governing the response distribution (that is, $\mu$), not the actual response *data*. Finally, because in practice we never have the underlying parameter to transform, in discussions of these models, often what is considered to be the actual link is left implicit and the model is represented by the *inverse* of the link function applied to the structural component instead. That is:

$$
\mu=g^{-1}(\beta_0+\beta_1X)
$$
For instance, logistic regression is usually represented:
$$
\pi(Y)=\frac{\exp(\beta_0+\beta_1X)}{1+\exp(\beta_0+\beta_1X)}
$$
instead of:
$$
\ln\left(\frac{\pi(Y)}{1-\pi(Y)}\right)=\beta_0+\beta_1X
$$

For a quick and clear, but solid, overview of the generalized linear model, see chapter 10 of Fitzmaurice, Laird, & Ware (2004), (on which I leaned for parts of this answer, although since this is my own adaptation of that--and other--material, any mistakes would be my own). For how to fit these models in R, check out the documentation for the function ?glm in the base package.

*(One final note added later:)* I occasionally hear people say that you shouldn't use the probit, because it can't be interpreted. This is not true, although the interpretation of the betas is less intuitive. With logistic regression, a one unit change in $X_1$ is associated with a $\beta_1$ change in the log odds of 'success' (alternatively, an $\exp(\beta_1)$-fold change in the odds), all else being equal. With a probit, this would be a change of $\beta_1\text{ }z$'s. (Think of two observations in a dataset with $z$-scores of 1 and 2, for example.) To convert these into predicted *probabilities*, you can pass them through the normal CDF, or look them up on a $z$-table.

(+1 to both @vinux and @Elvis. Here I have tried to provide a broader framework within which to think about these things and then using that to address the choice between logit and probit.)

@whuber "When the response variable is not normally distributed (for example, if your response variable is binary) this approach [standard OLS] may no longer be valid." I'm sorry to bother you (again!) with this, but I find this bit confusing. I understand that there are no *unconditional* distributional assumptions on the dependent variable in OLS. Does this quote mean to imply that since the response is so wildly non-normal (i.e. a binary variable) that its *conditional* distribution given $X$ (and hence the distribution of the residuals) cannot possibly approach normality? – landroni – 2014-03-27T09:45:50.023

7@landroni, you may want to ask a new question for this. In short, if your response is binary, the conditional distribution of Y given X=xi cannot possibly approach normality; it will always be binomial. The distribution of the raw residuals will also never approach normality. They will always be pi & (1-pi). The *sampling distribution* of the conditional mean of Y given X=xi (ie, pi) will approach normality, though. – gung – 2014-03-27T13:41:57.517

2I share somewhat of landroni's concern: after all, a normally distributed outcome non-normally distributed residuals, and a non-normally distributed outcome may have normally distributed residuals. The issue with the outcome seems to be less about its distribution *per se*, than its range. – Alexis – 2014-07-23T20:47:40.990

61Thanks, guys. I'm glad this came together well; this is actually a good example of how you can learn things on CV by *answering* questions, as well as asking & reading others' answers: I knew this information beforehand, but not quite well enough that I could just write it out cold. So I actually spent some time going through my old texts to figure out how to organize the material & put it forward clearly, & in the process solidified these ideas for myself. – gung – 2012-06-22T18:18:10.800

6@gung Thanks for this explanation, it is one of the clearest descriptions of GLMs in general that I have come across. – fmark – 2012-09-27T23:35:23.457

45

In addition to vinux’ answer, which already tells the most important:

the coefficients $\beta$ in the logit regression have natural interpretations in terms of odds ratio;

the probistic regression is the natural model when you think that your binary outcome depends of a hidden gaussian variable $Z = X' \beta + \epsilon\ $ [eq. 1] with $\epsilon \sim \mathcal N(0,1)$ in a deterministic manner: $Y = 1$ exactly when $Z > 0$.

More generally, and more naturally, probistic regression is the more natural model if you think that the outcome is $1$ exactly when some $Z_0 = X' \beta_0 + \epsilon_0$ exceeds a threshold $c$, with $\epsilon \sim \mathcal N(0,\sigma^2)$. It is easy to see that this can be reduced to the aforementioned case: just rescale $Z_0$ as $Z = {1\over \sigma}(Z_0-c)$; it’s easy to check that equation [eq. 1] still holds (rescale the coefficients and translate the intercept). These models have been defended, for example, in medical contexts, where $Z_0$ would be an unobserved continuous variable, and $Y$ eg a disease which appears when $Z_0$ exceeds some "pathological threshold".

Both logit and probit models are only *models*. "All models are wrong, some are useful", as Box once said! Both models will allow you to *detect* the existence of an effect of $X$ on the outcome $Y$; except in some very special cases, none of them will be "really true", and their *interpretation* should be done with cautiousness.

16It is also worth noting that the usage of probit versus logit models is heavily influenced by disciplinary tradition. For instance, economist seem far more used to probit analysis while researchers in psychometrics rely mostly on logit models. – David – 2012-01-03T17:03:20.613

What is the model behind flipping a coin? – skan – 2016-10-25T18:58:21.573

27

An important point that has not been addressed in the previous (excellent) answers is the actual estimation step. Multinomial logit models have a PDF that is easy to integrate, leading to a closed-form expression of the choice probability. The density function of the normal distribution is not so easily integrated, so probit models typically require simulation. So while both models are abstractions of real world situations, logit is usually faster to use on larger problems (multiple alternatives or large datasets).

To see this more clearly, the probability of a particular outcome being selected is a function of the $x$ predictor variables and the $\varepsilon$ error terms (following Train)

$$ P = \int I[\varepsilon > -\beta'x] f(\varepsilon)d\varepsilon $$ Where $I$ is an indicator function, 1 if selected and zero otherwise. Evaluating this integral depends heavily on the assumption of $f(x)$. In a logit model, this is a logistic function, and a normal distribution in the probit model. For a logit model, this becomes

$$ P=\int_{\varepsilon=-\beta'x}^{\infty} f(\varepsilon)d\varepsilon\\ = 1- F(-\beta'x) = 1-\dfrac{1}{\exp(\beta'x)} $$

No such convenient form exists for probit models.

But, in the choice situation, probit is more flexible, so moore used today! multinomial logit implies the assumption of irrelevance of irrelevant alternatives, which not always is empirically justified. – kjetil b halvorsen – 2015-01-10T21:13:42.360

You're right that the IIA isn't always justified, and you're also right that with modern estimators probit models can be estimated reasonably quickly. But GEV models resolve the IIA problem and might better represent the choice structure in certain situations. I'm also not sure that probit is "more used today;" in my field (transportation modeling), probit models remain a novelty. – gregmacfarlane – 2015-01-14T18:00:25.043

4This is why multinomial logit functions are classically used to estimate spatial discrete choice problems, even though the actual phenomenon is better modelled by a probit. – fmark – 2012-09-27T23:37:37.413

How would you incorporate spatial elements into a DC model? I'm very interested. – gregmacfarlane – 2012-09-28T02:18:09.073

26

Regarding your statement

*I'm more interested here in knowing when to use logistic regression, and when to use probit*

There are already many answers here that bring up things to consider when choosing between the two but there is one important consideration that hasn't been stated yet: **When your interest is in looking at within-cluster associations in binary data using mixed effects logistic or probit models, there is a theoretical grounding for preferring the probit model.** This is, of course, assuming that there is no *a priori* reason for preferring the logistic model (e.g. if you're doing a simulation and know it to be the true model).

**First**, To see why this is true first note that both of these models can be viewed as thresholded continuous regression models. As an example consider the simple linear mixed effects model for the observation $i$ within cluster $j$:

$$ y^{\star}_{ij} = \mu + \eta_{j} + \varepsilon_{ij} $$

where $\eta_j \sim N(0,\sigma^2)$ is the cluster $j$ random effect and $\varepsilon_{ij}$ is the error term. Then both the logistic and probit regression models are equivalently formulated as being generated from this model and thresholding at 0:

$$ y_{ij} = \begin{cases} 1 & \text{if} \ \ \ y^{\star}_{ij}≥0\\ \\ 0 &\text{if} \ \ \ y^{\star}_{ij}<0 \end{cases} $$

If the $\varepsilon_{ij}$ term is normally distributed, you have a probit regression and if it is logistically distributed you have a logistic regression model. Since the scale is not identified, these residuals errors are specified as standard normal and standard logistic, respectively.

**Pearson (1900)** showed that that if multivariate normal data were generated and thresholded to be categorical, the correlations between the underlying variables were still statistically identified - these correlations are termed *polychoric correlations* and, specific to the binary case, they are termed *tetrachoric correlations*. This means that, in a probit model, the intraclass correlation coefficient of the underlying normally distributed variables:

$$ {\rm ICC} = \frac{ \hat{\sigma}^{2} }{\hat{\sigma}^{2} + 1 } $$

is identified which means that **in the probit case you can fully characterize the joint distribution of the underlying latent variables**.

In the logistic model the random effect variance in the logistic model is still identified but it does not fully characterize the dependence structure (and therefore the joint distribution), since it is **a mixture between a normal and a logistic random variable** that does not have the property that it is fully specified by its mean and covariance matrix. Noting this odd parametric assumption for the underlying latent variables makes interpretation of the random effects in the logistic model less clear to interpret in general.

5There are other situations in which one would prefer probit as well. Econometric selection models (i.e. Heckman) are only proven using the probit model. I'm less sure of this, but I also believe some SEM models where binary variables are endogenous also utilize the probit model because of the assumption of multivariate normality needed for maximum likelihood estimation. – Andy W – 2012-06-22T15:25:27.950

@AndyW, you're right about binary SEMs - and that is closely related to the point I've made here - the estimation (and subsequent interpretation) there is supported by the fact that the underlying correlations are identified and fully characterize the joint distribution. – Macro – 2012-06-22T15:33:46.260

10

What I am going to say in no way invalidates what has been said thus far. I just want to point out that probit models do not suffer from IIA (Independence of Irrelevant alternatives) assumptions, and the logit model does.

To use an example from Train's excellent book. If I have a logit that predicts whether I am going to ride the blue bus or drive in my car, adding red bus would draw from both car and blue bus proportionally. But using a probit model you can avoid this problem. In essence, instead of drawing from both proportionally, you may draw more from blue bus as they are closer substitutes.

The sacrifice you make is that there is no closed form solutions, as pointed out above. Probit tends to be my goto when I am worried about IIA issues. That's not to say that there aren't ways to get around IIA in a logit framework (GEV distributions). But I've always looked at these sorts of models as a clunky way around the problem. With the computational speeds that you can get, I would say go with probit.

Could you explain the "Independence of Irrelevant alternatives", please? – skan – 2016-10-25T18:53:34.003

Note that it is still possible to estimate a multinomial probit model that enforces a variant of the IIA assumption (like in the mprobit command in Stata). In order to do away with IIA in multinomial probit you must model the variance-covariance matrix of the latent variable errors for each alternative in the response variable. – Kenji – 2017-03-22T15:22:49.433

7

One of the most well-known difference between logit and probit is the (theoretical) regression residuals distribution: normal for probit, logistic for logit (please see: Koop G. An Introduction to Econometrics Chichester, Wiley: 2008: 280).

2but how do we know whether our data should have a theoretical normal or logistic residual distribution?, for example when I flip a coin. – skan – 2016-10-25T19:01:40.553

6

I offer a practical answer to the question, that only focuses on "when to use logistic regression, and when to use probit", without getting into statistical details, but rather focusing on decisions based on statistics. The answer depends on two main things: do you have a disciplinary preference, and do you only care for which model better fits your data?

**Basic difference**

Both logit and probit models provide statistical models that give the probability that a dependent response variable would be 0 or 1. They are very similar and often given practically idential results, but because they use different functions to calculate the probabilities, their results are sometimes slightly different.

**Disciplinary preference**

Some academic disciplines generally prefer one or the other. If you are going to publish or present your results to an academic discipline with a specific traditional preference, then let that dictate your choice so that your findings would be more readily acceptable. For example (from Methods Consultants),

Logit – also known as logistic regression – is more popular in health sciences like epidemiology partly because coefficients can be interpreted in terms of odds ratios. Probit models can be generalized to account for non-constant error variances in more advanced econometric settings (known as heteroskedastic probit models) and hence are used in some contexts by economists and political scientists.

The point is that the differences in results are so minor that the ability for your general audience to understand your results outweigh the minor differences between the two approaches.

**If all you care about is better fit...**

If your research is in a discipline that does not prefer one or the other, then my study of this question (which is better, logit or probit) has led me to conclude that it is generally better to use **probit**, since it almost always will give a statistical fit to data that is equal or superior to that of the logit model. The most notable exception when logit models give a better fit is in the case of "extreme independent variables" (which I explain below).

My conclusion is based almost entirely (after searching numerous other sources) on Hahn, E.D. & Soyer, R., 2005. Probit and logit models: Differences in the multivariate realm. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.329.4866&rep=rep1&type=pdf. Here is my summary of the practical decision conclusions from this article concerning whether logit versus probit multivariate models provide a better fit to the data (these conclusions also apply to univariate models, but they only simulated effects for two independent variables):

In most scenarios, the logit and probit models fit the data equally well, with the following two exceptions.

**Logit is definitely better in the case of "extreme independent variables"**. These are independent variables where one particularly large or small value will overwhelmingly often determine whether the dependent variable is a 0 or a 1, overriding the effects of most other variables. Hahn and Soyer formally define it thus (p. 4):

An extreme independent variable level involves the conﬂuence of three events. First, an extreme independent variable level occurs at the upper or lower extreme of an independent variable. For example, say the independent variable x were to take on the values 1, 2, and 3.2. The extreme independent variable level would involve the values at x = 3.2 (or x = 1). Second, a substantial proportion (e.g., 60%) of the total n must be at this level. Third, the probability of success at this level should itself be extreme (e.g., greater than 99%).

**Probit is better in the case of "random effects models"**with moderate or large sample sizes (it is equal to logit for small sample sizes). For fixed effects models, probit and logit are equally good. I don't really understand what Hahn and Soyer mean by "random effects models" in their article. Although many definitions are offered (as in this Stack Exchange question), the definition of the term is in fact ambiguous and inconsistent. But since logit is never superior to probit in this regard, the point is rendered moot by simply choosing probit.

Based on Hahn and Soyer's analysis, my conclusion is to **always use probit models except in the case of extreme independent variables, in which case logit should be chosen**. Extreme independent variables are not all that common, and should be quite easy to recognize. With this rule of thumb, it doesn't matter whether the model is a random effects model or not. In cases where a model is a random effects model (where probit is preferred) but there are extreme independent variables (where logit is preferred), although Hahn and Soyer didn't comment on this, my impression from their article is that the effect of extreme independent variables are more dominant, and so logit would be preferred.

4

Below, I explain an estimator that nests probit and logit as special cases and where one can test which is more appropriate.

Both probit and logit can be nested in a latent variable model,

$$ y_i^* = x_i \beta + \varepsilon_i,\quad \varepsilon_i \sim G(\cdot), $$

where the observed component is

$$ y_i = \mathbb{1}(y_i^* > 0). $$

If you choose $G$ to be the normal cdf, you get probit, if you choose the logistic cdf, you get logit. Either way, the likelihood function takes the form

$$ \ell(\beta) = y_i \log G(x_i\beta) + (1-y_i) \log[1-G(x_i\beta)].$$

However, if you are concerned about which assumption you have made, you can use the Klein & Spady (1993; Econometrica) estimator. This estimator allows you to be fully flexible in your specification of the cdf, $G$, and you could then even subsequently test the validity of normality or logisticness (?).

In Klein & Spady, the criterion function is instead

$$ \ell(\beta) = y_i \log \hat{G}(x_i\beta) + (1-y_i) \log[1-\hat{G}(x_i\beta)],$$

where $\hat{G}(\cdot)$ is a nonparametric estimate of the cdf, for example estimated using a Nadaraya-Watson kernel regression estimator,

$$ \hat{G}(z) = \sum_{i=1}^N y_i \frac{ K\left( \frac{z - x_i\beta}{h} \right)}{\sum_{j=1}^N K\left( \frac{z - x_j\beta}{h} \right)}, $$

where $K$ is called the "Kernel" (typically, the Gaussian cdf or a triangular kernel is chosen), and $h$ is a "bandwidth". There are plugin values to pick for the latter but it can be a lot more complicated and it can make the outer optimization over $\beta$ more complicated if $h$ changes in every step ($h$ balances the so-called *bias-variance tradeoff*).

**Improvements:** Ichimura has suggested that the kernel regression, $\hat{G}$, should leave out the $i$th observation; otherwise, the choice of $h$ may be complicated by a problem with over-fitting in sample (too high variance).

**Discussion:** One drawback with the Klein-Spady estimator is that it may get stuck in local minima. This is because the $G$ cdf adapts to the given $\beta$-parameters. I know of several students who have tried implementing it and have had problems achieving convergence and avoiding numerical issues. Hence, it is not an easy estimator to work with. Moreover, inference on the estimated parameters is complicated by the semi-parametric specification for $G$.

2

They are very similar.

In both models, the probability that $Y=1$ given $X$ can be seen as the probability that a **random hidden variable $S$** (with a certain fixed distribution) **is below a certain threshold** that depends linearly on $X$ :

$$P(Y=1|X)=P(S<\beta X)$$

Or equivalently :

$$P(Y=1|X)=P(\beta X-S>0)$$

Then it's all a matter of what you choose for the distribution of $S$ :

- in logistic regression, $S$ has a logistic distribution.
- in probit regression, $S$ has a normal distribution.

Variance is unimportant since it is automatically compensated by multiplying $\beta$ by a constant. Mean is unimportant as well if you use an intercept.

This can be seen as a threshold effect. Some invisible outcome $E=\beta X-S$ is a linear function of $X$ with some noise $-S$ added like in linear regression, and we get a 0/1 outcome by saying:

- when $E>0$, outcome is $Y=1$
- when $E<0$, outcome is $Y=0$

The differences between logistic and probit lies in the difference between the logistic and the normal distributions. There ain't that much. Once adjusted, they look like it :

Logistic has heavier tail. This may impact a little how events of small (<1%) or high (>99%) probability are fitted. Practically, the difference is not even noticeable in most situations : logit and probit predict essentially the same thing. See http://scholarworks.rit.edu/cgi/viewcontent.cgi?article=2237&context=article

"Philosophically", logistic regression can be justified by being equivalent to the principle of maximum entropy : http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/

In terms of calculation : logistic is simpler since the cumulative distribution of the logistic distribution has a closed formula unlike the normal distribution. But normal distributions have good properties when you go to multi-dimensional, this is why probit is often preferred in advanced cases.

4There exists hardly any difference between the results of the two (see Paap&Franses 2000) – None – 2013-12-12T11:25:45.827

1I once had an extensive (bioassay) dataset where we could see probit fitted marginally better, but it made no difference for conclusions. – kjetil b halvorsen – 2015-04-08T07:16:08.480

the graph of logit model is aproach to 0 slowly while the the probit model quickly...... – None – 2015-06-04T12:55:48.490

1@Alyas Shah: and that is the explanation why with my data probit fited (marginally) better---because above a certain dose, mortality is 100%, and below some treshold, mortality is 0%, so we dont see the slow approach of the logit! – kjetil b halvorsen – 2015-06-04T13:58:39.430

3For real data,by opposition with data generated from either logit or probit, a considerate approach to the issue would be to run a model comparison. In my experience, the data rarely leans towards one of the two models. – Xi'an – 2015-11-11T09:49:48.903

1I've heard that the practical use of the logistic distribution originates from its similarity to the normal CDF and its much simpler cumulative distribution function. Indeed the normal CDF contains an integral that must be evaluated - which I guess was computationally costly back in the days. – dv_bn – 2016-05-02T14:49:59.643

@kjetilbhalvorsen: Maybe a cardinal logistic with three or more levels would be a best fit? – MSIS – 2016-08-26T18:21:26.007