I think that tests for normality can be useful as companions to graphical examinations. They have to be used in the right way, though. In my opinion, this means that **many popular tests, such as the Shapiro-Wilk, Anderson-Darling and Jarque-Bera tests never should be used.**

Before I explain my standpoint, let me make a few remarks:

- In an interesting recent paper Rochon et al. studied the impact of the Shapiro-Wilk test on the two-sample t-test.
**The two-step procedure of testing for normality before carrying out for instance a t-test is not without problems.** Then again, **neither is the two-step procedure of graphically investigating normality** before carrying out a t-test. The difference is that the impact of the latter is much more difficult to investigate (as it would require a statistician to graphically investigate normality $100,000$ or so times...).
- It is useful to
**quantify non-normality**, for instance by computing the sample skewness, even if you don't want to perform a formal test.
**Multivariate normality can be difficult to assess graphically** and convergence to asymptotic distributions can be slow for multivariate statistics. Tests for normality are therefore more useful in a multivariate setting.
- Tests for normality are perhaps
**especially useful for practitioners who use statistics as a set of black-box methods**. When normality is rejected, the practitioner should be alarmed and, rather than carrying out a standard procedure based on the assumption of normality, consider using a nonparametric procedure, applying a transformation or consulting a more experienced statistician.
- As has been pointed out by others, if $n$ is large enough, the CLT usually saves the day. However, what is "large enough" differs for different classes of distributions.

(In my definiton) a test for normality is directed directed against a class of alternatives if it is sensitive to alternatives from that class, but not sensitive to alternatives from other classes. Typical examples are tests that are directed towards skew or kurtotic alternatives. The simplest examples use the sample skewness and kurtosis as test statistics.

Directed tests of normality are arguably often preferable to omnibus tests (such as the Shapiro-Wilk and Jarque-Bera tests) since **it is common that only some types of non-normality are of concern for a particular inferential procedure**.

Let's consider Student's t-test as an example. Assume that we have an i.i.d. sample from a distribution with skewness $\gamma=\frac{E(X-\mu)^3}{\sigma^3}$ and (excess) kurtosis $\kappa=\frac{E(X-\mu)^4}{\sigma^4}-3.$ If $X$ is symmetric about its mean, $\gamma=0$. Both $\gamma$ and $\kappa$ are 0 for the normal distribution.

Under regularity assumptions, we obtain the following asymptotic expansion for the cdf of the test statistic $T_n$:
$$P(T_n\leq x)=\Phi(x)+n^{-1/2}\frac{1}{6}\gamma(2x^2+1)\phi(x)-n^{-1}x\Big(\frac{1}{12}\kappa (x^2-3)-\frac{1}{18}\gamma^2(x^4+2x^2-3)-\frac{1}{4}(x^2+3)\Big)\phi(x)+o(n^{-1}),$$

where $\Phi(\cdot)$ is the cdf and $\phi(\cdot)$ is the pdf of the standard normal distribution.

$\gamma$ appears for the first time in the $n^{-1/2}$ term, whereas $\kappa$ appears in the $n^{-1}$ term. The *asymptotic* performance of $T_n$ is much more sensitive to deviations from normality in the form of skewness than in the form of kurtosis.

It can be verified using simulations that this is true for small $n$ as well. Thus Student's t-test is sensitive to skewness but relatively robust against heavy tails, and **it is reasonable to use a test for normality that is directed towards skew alternatives before applying the t-test**.

As a *rule of thumb* (*not* a law of nature), inference about means is sensitive to skewness and inference about variances is sensitive to kurtosis.

Using a directed test for normality has the benefit of getting higher power against ''dangerous'' alternatives and lower power against alternatives that are less ''dangerous'', meaning that we are less likely to reject normality because of deviations from normality that won't affect the performance of our inferential procedure. **The non-normality is quantified in a way that is relevant to the problem at hand.** This is not always easy to do graphically.

As $n$ gets larger, skewness and kurtosis become less important - and directed tests are likely to detect if these quantities deviate from 0 even by a small amount. In such cases, it seems reasonable to, for instance, test whether $|\gamma|\leq 1$ or (looking at the first term of the expansion above) $$|n^{-1/2}\frac{1}{6}\gamma(2z_{\alpha/2}^2+1)\phi(z_{\alpha/2})|\leq 0.01$$ rather than whether $\gamma=0$. This takes care of some of the problems that we otherwise face as $n$ gets larger.

22For reference: I don't think that this needed to be community wiki. – Shane – 2010-09-08T17:57:46.223

2I wasn't sure there was a 'right answer'... – shabbychef – 2010-09-08T18:01:40.223

See http://meta.stats.stackexchange.com/questions/290/what-is-community-wiki

– Shane – 2010-09-08T18:03:57.1935In a certain sense, this is true of all test of a finite number of parameters. With $k$ fixed (the number of parameters on which the test is caried) and $n$ growthing without bounds, any difference between the two groups (no matter how small) will always break the null at some point. Actually, this is an argument in favor of bayesian tests. – user603 – 2010-09-08T18:07:28.977

1For me, it is not a valid argument. Anyway, before giving any answer you need to formalize things a little bit. You may be wrong and you may not be but now what you have is nothing more than an intuition: for me the sentence "In the era of cheap memory, big data, and fast processors, normality tests should always reject the null of normal " needs clarifications :) I think that if you try giving more formal precision the answer will be simple. – robin girard – 2010-09-08T19:01:08.107

4

The thread at "Are large datasets inappropriate for hypothesis testing" discusses a generalization of this question. (http://stats.stackexchange.com/questions/2516/are-large-data-sets-inappropriate-for-hypothesis-testing )

– whuber – 2010-09-09T20:17:48.403