Is normality testing 'essentially useless'?

237

222

A former colleague once argued to me as follows:

We usually apply normality tests to the results of processes that, under the null, generate random variables that are only asymptotically or nearly normal (with the 'asymptotically' part dependent on some quantity which we cannot make large); In the era of cheap memory, big data, and fast processors, normality tests should always reject the null of normal distribution for large (though not insanely large) samples. And so, perversely, normality tests should only be used for small samples, when they presumably have lower power and less control over type I rate.

Is this a valid argument? Is this a well-known argument? Are there well known tests for a 'fuzzier' null hypothesis than normality?

shabbychef

Posted 2010-09-08T17:47:21.820

Reputation: 6 785

22For reference: I don't think that this needed to be community wiki. – Shane – 2010-09-08T17:57:46.223

2I wasn't sure there was a 'right answer'... – shabbychef – 2010-09-08T18:01:40.223

5In a certain sense, this is true of all test of a finite number of parameters. With $k$ fixed (the number of parameters on which the test is caried) and $n$ growthing without bounds, any difference between the two groups (no matter how small) will always break the null at some point. Actually, this is an argument in favor of bayesian tests. – user603 – 2010-09-08T18:07:28.977

1For me, it is not a valid argument. Anyway, before giving any answer you need to formalize things a little bit. You may be wrong and you may not be but now what you have is nothing more than an intuition: for me the sentence "In the era of cheap memory, big data, and fast processors, normality tests should always reject the null of normal " needs clarifications :) I think that if you try giving more formal precision the answer will be simple. – robin girard – 2010-09-08T19:01:08.107

4

The thread at "Are large datasets inappropriate for hypothesis testing" discusses a generalization of this question. (http://stats.stackexchange.com/questions/2516/are-large-data-sets-inappropriate-for-hypothesis-testing )

– whuber – 2010-09-09T20:17:48.403

Answers

187

It's not an argument. It is a (a bit strongly stated) fact that formal normality tests always reject on the huge sample sizes we work with today. It's even easy to prove that when n gets large, even the smallest deviation from perfect normality will lead to a significant result. And as every dataset has some degree of randomness, no single dataset will be a perfectly normally distributed sample. But in applied statistics the question is not whether the data/residuals ... are perfectly normal, but normal enough for the assumptions to hold.

Let me illustrate with the Shapiro-Wilk test. The code below constructs a set of distributions that approach normality but aren't completely normal. Next, we test with shapiro.test whether a sample from these almost-normal distributions deviate from normality. In R:

x <- replicate(100, { # generates 100 different tests on each distribution
                     c(shapiro.test(rnorm(10)+c(1,0,2,0,1))$p.value,   #$
                       shapiro.test(rnorm(100)+c(1,0,2,0,1))$p.value,  #$
                       shapiro.test(rnorm(1000)+c(1,0,2,0,1))$p.value, #$
                       shapiro.test(rnorm(5000)+c(1,0,2,0,1))$p.value) #$
                    } # rnorm gives a random draw from the normal distribution
               )
rownames(x) <- c("n10","n100","n1000","n5000")

rowMeans(x<0.05) # the proportion of significant deviations
  n10  n100 n1000 n5000 
 0.04  0.04  0.20  0.87 

The last line checks which fraction of the simulations for every sample size deviate significantly from normality. So in 87% of the cases, a sample of 5000 observations deviates significantly from normality according to Shapiro-Wilks. Yet, if you see the qq plots, you would never ever decide on a deviation from normality. Below you see as an example the qq-plots for one set of random samples

alt text

with p-values

  n10  n100 n1000 n5000 
0.760 0.681 0.164 0.007 

Joris Meys

Posted 2010-09-08T17:47:21.820

Reputation: 4 195

Wow thx for your answer! How did you draw the qqplots? – Le Max – 2013-03-17T10:03:40.477

1@maximus with the function qqnormin R – Joris Meys – 2013-03-19T17:08:13.440

+1: great answer, very intuitive. Perhaps a bit off-topic but how would one go about implement the second method without qq-plots (due to lack of visualization)? What logical steps are taken here to get the p-values? – posdef – 2011-02-10T13:04:34.003

@posdef : those are just the p-values of the shapiro-wilks test, to indicate that they contradict the qq-plots. – Joris Meys – 2011-02-10T13:31:37.780

1@joris: I think there might have been a misunderstanding; Shapiro-Wilks give p{n5000} = 0.87 while the second calculation yields p{n5000} = 0.007. Or have I misunderstood something? – posdef – 2011-02-10T14:58:06.390

1Indeed. 0.87 is the proportion of datasets that give a deviation from normality, meaning that in 87% of the datasets from an almost normal distribution, Shapiro-Wilks will have a p-value smaller than 0.05. The second part is just an example of some datasets that illustrate this. – Joris Meys – 2011-02-10T15:02:03.170

@joris: I see, thanks for straightening it out for me :) – posdef – 2011-02-11T08:46:10.490

7@joris-meys the central limit theorem does not help unless the population standard deviation is known. Very tiny disturbances in the random variable can distort the sample variance and make the distribution of a test statistic very far from the $t$ distribution, as shown by Rand Wilcox. – Frank Harrell – 2013-08-01T11:42:45.433

40This answer appears not to address the question: it merely demonstrates that the S-W test does not achieve its nominal confidence level, and so it identifies a flaw in that test (or at least in the R implementation of it). But that's all--it has no bearing on the scope of usefulness of normality testing in general. The initial assertion that normality tests always reject on large sample sizes is simply incorrect. – whuber – 2013-10-24T21:16:49.317

12@whuber This answer addresses the question. The whole point of the question is the "near" in "near-normality". S-W tests what is the chance that the sample is drawn from a normal distribution. As the distributions I constructed are deliberately not normal, you'd expect the S-W test to do what it promises: reject the null. The whole point is that this rejection is meaningless in large samples, as the deviation from normality does not result in a loss of power there. So the test is correct, but meaningless, as shown by the QQplots – Joris Meys – 2013-10-25T09:36:20.783

1@FrankHarrell I fail to see your point. Rand Wilcox was talking about sample sizes of 30 and more. The question is about very large samples. 30 isn't even large. 5000, that's large (and not that large actually). Doing the math Rand Wilcox did, the variance of the mean follows the chi-squared distribution pretty well for a sample of 5000, even when originating from a pretty skewed distribution. – Joris Meys – 2013-10-25T09:45:31.457

3The fact that often we can't tell from a sample whether that sample can adequately be analyzed by a normality-assuming method is enough for me. And Wilcox gives examples where the non-normality (contamination of a normal distribution with another normal distribution with higher variance) is so imperceptible that you cannot see it in the density function, yet the tiny bit of non-normality causes a major distortion in tests' operating characteristics. Another issue that most statisticians have not really addressed is that the standard deviation may not be meaningful with asymmetry. – Frank Harrell – 2013-10-25T12:05:42.900

2That fact is true, but has no bearance with the CLT. The CLT is pretty specific about under what conditions the approximation holds. You throw different things on the same heap. Yes, Wilcox gives those examples. No, he isn't talking about large sample sizes or dismissing the CLT, far from even. He rightfully points out people forget about the conditions under which the CLT holds. I agree with you that testing differences with a sample size of 5000 doesn't make sense without stating what the minimal relevant difference is. But that's a whole other issue. – Joris Meys – 2013-10-25T12:32:47.493

8I had relied on what you wrote and misunderstood what you meant by an "almost-Normal" distribution. I now see--but only by reading the code and carefully testing it--that you are simulating from three standard Normal distributions with means at $0,$ $1,$ and $2$ and combining the results in a $2:2:1$ ratio. Wouldn't you hope that a good test of Normality would reject the null in this case? What you have effectively demonstrated is that QQ plots are not very good at detecting such mixtures, that's all! – whuber – 2013-10-25T14:17:25.217

7Not one real life distribution is perfectly normal. So with large enough samples, all normality test should reject the null. So yes, SW does what it needs to do. But it is worthless for applied statistics. There's no point in going to eg a Wilcoxon when having a sample size of 5000 and an almost normal distribution. And that's what OP's remark was all about: does it make sense to test for normality when having large sample sizes? Answer: no. Why? because you detect (correctly) a deviation that doesn't matter for your analysis. As pointed out by the QQ plots – Joris Meys – 2013-10-25T16:03:14.130

2Btw, QQ plots are not meant to detect such mixtures. They're graphical tools that give you a fair idea about whether or not you'll lose power an even get biased estimates when using specific tests. That's all there is to them. For 99% of the statistical questions in practical science, that's more than enough. – Joris Meys – 2013-10-25T16:03:32.380

2I don't disagree with you; I am only (mildly) objecting that the important points you have recently made in these comments did not appear in your answer. – whuber – 2013-10-29T18:22:49.890

@whuber You're free to update :) otherwise I'll update it when I find a bit more time. Cheers. – Joris Meys – 2013-11-06T14:36:54.173

this is great! I'm slapping myself for not doing the experiments myself... – shabbychef – 2010-09-08T22:35:17.087

29On a side note, the central limit theorem makes the formal normality check unnecessary in many cases when n is large. – Joris Meys – 2010-09-08T23:19:31.450

26yes, the real question is not whether the data are actually distributed normally but are they sufficiently normal for the underlying assumption of normality to be reasonable for the practical purpose of the analysis, and I would have thought the CLT based argument is normally [sic] sufficient for that. – Dikran Marsupial – 2010-09-09T09:37:22.440

3This is another example of why p-values need to move down as the sample size goes up. 0.05 is not stringent enough in big data world. Just my curiosity - what happens if you set the pvalue to depend on sample size? – probabilityislogic – 2012-02-04T14:24:50.770

@JorisMeys Could You point me to a paper or a proof that "when n gets large, even the smallest deviation from perfect normality will lead to a significant result"? :) – Milos – 2016-09-03T14:06:24.107

1@Milos Even in the original paper the author refered already to the statistic as sensitive, even with small samples (n < 20). It is also sensitive to outliers, according to the same 1965 paper. Also remember that the W statistic has a maximum of 1 (indicating perfect normality) and look at the critical values of W for rejecting the null. At n=10, this is 0.84. At n=50, this is 0.947. So at n=50, a far smaller deviation will be significant. At n=5000, even a W value of 0.999 is highly significant. That's basic statistics. – Joris Meys – 2016-09-04T16:47:01.207

This example could be used as an argument that failing such a "normality test" should be an argument for applying regression or other classification methods (rather than immediately applying a transformation). – DWin – 2017-06-16T18:08:48.050

@JorisMeys Thanks for your illustrative answer. Your post clearly illustrates the problem, but what is the solution? Is there an "almost normal" test? Something conceptually like a TOST equivalence test? I am facing this exact issue where a reviewer that is asking for justification of normality assumption - the QQ plots look good, but the test is significant due to large sample size. – thc – 2017-12-13T20:02:42.047

@thc Just use the QQ plot to justify it. And if the sample size is large enough, the central limit theorem provides you with the normality assumption already in many cases. – Joris Meys – 2017-12-14T10:22:49.013

145

When thinking about whether normality testing is 'essentially useless', one first has to think about what it is supposed to be useful for. Many people (well... at least, many scientists) misunderstand the question the normality test answers.

The question normality tests answer: Is there convincing evidence of any deviation from the Gaussian ideal? With moderately large real data sets, the answer is almost always yes.

The question scientists often expect the normality test to answer: Do the data deviate enough from the Gaussian ideal to "forbid" use of a test that assumes a Gaussian distribution? Scientists often want the normality test to be the referee that decides when to abandon conventional (ANOVA, etc.) tests and instead analyze transformed data or use a rank-based nonparametric test or a resampling or bootstrap approach. For this purpose, normality tests are not very useful.

Harvey Motulsky

Posted 2010-09-08T17:47:21.820

Reputation: 9 229

6There's is not substitute for the (common) sense of the analyst (or, well, the researcher/scientist). And experience (learnt by trying and seeing: what conclusions do I get if I assume it is normal? What are the difference if not?). Graphics are your best friends. – FairMiles – 2013-04-05T15:33:15.650

12

+1 for a good and informative answer. I find it useful to see a good explanation for a common misunderstanding (which I have incidentally been experiencing myself: http://stats.stackexchange.com/questions/7022/parameter-estimation-for-normal-distribution-in-java). What I miss though, is an alternative solution to this common misunderstanding. I mean, if normality tests are the wrong way to go, how does one go about checking if a normal approximation is acceptable/justified?

– posdef – 2011-02-10T12:45:49.370

2I like this paper, which makes the point you made: Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 105(1), 156-166. – Jeremy Miles – 2014-08-20T20:18:18.530

1

Looking at graphics is great, but what if there are too many to examine manually? Can we formulate reasonable statistical procedures to point out possible trouble spots? I'm thinking of situations like A/B experimenters at large scale: http://www.exp-platform.com/Pages/SevenRulesofThumbforWebSiteExperimenters.aspx.

– dfrankow – 2014-12-29T17:41:33.073

93

I think that tests for normality can be useful as companions to graphical examinations. They have to be used in the right way, though. In my opinion, this means that many popular tests, such as the Shapiro-Wilk, Anderson-Darling and Jarque-Bera tests never should be used.

Before I explain my standpoint, let me make a few remarks:

  • In an interesting recent paper Rochon et al. studied the impact of the Shapiro-Wilk test on the two-sample t-test. The two-step procedure of testing for normality before carrying out for instance a t-test is not without problems. Then again, neither is the two-step procedure of graphically investigating normality before carrying out a t-test. The difference is that the impact of the latter is much more difficult to investigate (as it would require a statistician to graphically investigate normality $100,000$ or so times...).
  • It is useful to quantify non-normality, for instance by computing the sample skewness, even if you don't want to perform a formal test.
  • Multivariate normality can be difficult to assess graphically and convergence to asymptotic distributions can be slow for multivariate statistics. Tests for normality are therefore more useful in a multivariate setting.
  • Tests for normality are perhaps especially useful for practitioners who use statistics as a set of black-box methods. When normality is rejected, the practitioner should be alarmed and, rather than carrying out a standard procedure based on the assumption of normality, consider using a nonparametric procedure, applying a transformation or consulting a more experienced statistician.
  • As has been pointed out by others, if $n$ is large enough, the CLT usually saves the day. However, what is "large enough" differs for different classes of distributions.

(In my definiton) a test for normality is directed directed against a class of alternatives if it is sensitive to alternatives from that class, but not sensitive to alternatives from other classes. Typical examples are tests that are directed towards skew or kurtotic alternatives. The simplest examples use the sample skewness and kurtosis as test statistics.

Directed tests of normality are arguably often preferable to omnibus tests (such as the Shapiro-Wilk and Jarque-Bera tests) since it is common that only some types of non-normality are of concern for a particular inferential procedure.

Let's consider Student's t-test as an example. Assume that we have an i.i.d. sample from a distribution with skewness $\gamma=\frac{E(X-\mu)^3}{\sigma^3}$ and (excess) kurtosis $\kappa=\frac{E(X-\mu)^4}{\sigma^4}-3.$ If $X$ is symmetric about its mean, $\gamma=0$. Both $\gamma$ and $\kappa$ are 0 for the normal distribution.

Under regularity assumptions, we obtain the following asymptotic expansion for the cdf of the test statistic $T_n$: $$P(T_n\leq x)=\Phi(x)+n^{-1/2}\frac{1}{6}\gamma(2x^2+1)\phi(x)-n^{-1}x\Big(\frac{1}{12}\kappa (x^2-3)-\frac{1}{18}\gamma^2(x^4+2x^2-3)-\frac{1}{4}(x^2+3)\Big)\phi(x)+o(n^{-1}),$$

where $\Phi(\cdot)$ is the cdf and $\phi(\cdot)$ is the pdf of the standard normal distribution.

$\gamma$ appears for the first time in the $n^{-1/2}$ term, whereas $\kappa$ appears in the $n^{-1}$ term. The asymptotic performance of $T_n$ is much more sensitive to deviations from normality in the form of skewness than in the form of kurtosis.

It can be verified using simulations that this is true for small $n$ as well. Thus Student's t-test is sensitive to skewness but relatively robust against heavy tails, and it is reasonable to use a test for normality that is directed towards skew alternatives before applying the t-test.

As a rule of thumb (not a law of nature), inference about means is sensitive to skewness and inference about variances is sensitive to kurtosis.

Using a directed test for normality has the benefit of getting higher power against ''dangerous'' alternatives and lower power against alternatives that are less ''dangerous'', meaning that we are less likely to reject normality because of deviations from normality that won't affect the performance of our inferential procedure. The non-normality is quantified in a way that is relevant to the problem at hand. This is not always easy to do graphically.

As $n$ gets larger, skewness and kurtosis become less important - and directed tests are likely to detect if these quantities deviate from 0 even by a small amount. In such cases, it seems reasonable to, for instance, test whether $|\gamma|\leq 1$ or (looking at the first term of the expansion above) $$|n^{-1/2}\frac{1}{6}\gamma(2z_{\alpha/2}^2+1)\phi(z_{\alpha/2})|\leq 0.01$$ rather than whether $\gamma=0$. This takes care of some of the problems that we otherwise face as $n$ gets larger.

MånsT

Posted 2010-09-08T17:47:21.820

Reputation: 7 792

1Now this is a great answer! – user603 – 2014-04-04T10:45:39.423

8Yea this should be the accepted, really fantastic answer – jenesaisquoi – 2014-04-14T19:24:30.373

1"it is common that only some types of non-normality are of concern for a particular inferential procedure." - of course one should then use a test directed towards that type of non-normality. But the fact that one is using a normality test implies that he cares about all aspects of normality. The question is: is a normality test in that case a good option. – rbm – 2015-07-04T11:12:00.340

Test for the sufficiency of assumptions for particular tests are becoming common, which thankfully removes some of the guesswork. – Carl – 2017-01-07T21:27:54.613

51

IMHO normality tests are absolutely useless for the following reasons:

  1. On small samples, there's a good chance that the true distribution of the population is substantially non-normal, but the normality test isn't powerful to pick it up.

  2. On large samples, things like the T-test and ANOVA are pretty robust to non-normality.

  3. The whole idea of a normally distributed population is just a convenient mathematical approximation anyhow. None of the quantities typically dealt with statistically could plausibly have distributions with a support of all real numbers. For example, people can't have a negative height. Something can't have negative mass or more mass than there is in the universe. Therefore, it's safe to say that nothing is exactly normally distributed in the real world.

dsimcha

Posted 2010-09-08T17:47:21.820

Reputation: 3 839

4@dsimcha, the $t$-test and ANOVA are not robust to non-normality. See papers by Rand Wilcox. – Frank Harrell – 2013-08-01T11:45:56.090

2Electrical potential difference is an example of a real-world quantity that can be negative. – nico – 2010-09-19T13:03:21.280

13@nico: Sure it can be negative, but there's some finite limit to it because there are only so many protons and electrons in the Universe. Of course this is irrelevant in practice, but that's my point. Nothing is exactly normally distributed (the model is wrong), but there are lots of things that are close enough (the model is useful). Basically, you already knew the model was wrong, and rejecting or not rejecting the null gives essentially no information about whether it's nonetheless useful. – dsimcha – 2010-09-22T19:39:17.780

@dsimcha - I find that a really insightful, useful response. – rolando2 – 2012-05-04T21:34:22.340

@dsimcha "the model is wrong". Aren't ALL models "wrong" though? – Atirag – 2017-12-19T21:09:47.983

20

I think that pre-testing for normality (which includes informal assessments using graphics) misses the point.

  1. Users of this approach assume that the normality assessment has in effect a power near 1.0.
  2. Nonparametric tests such as the Wilcoxon, Spearman, and Kruskal-Wallis have efficiency of 0.95 if normality holds.
  3. In view of 2. one can pre-specify the use of a nonparametric test if one even entertains the possibility that the data may not arise from a normal distribution.
  4. Ordinal cumulative probability models (the proportional odds model being a member of this class) generalize standard nonparametric tests. Ordinal models are completely transformation-invariant with respect to $Y$, are robust, powerful, and allow estimation of quantiles and mean of $Y$.

Frank Harrell

Posted 2010-09-08T17:47:21.820

Reputation: 49 422

10

Before asking whether a test or any sort of rough check for normality is "useful" you have to answer the question behind the question: "Why are you asking?"

For example, if you only want to put a confidence limit around the mean of a set of data, departures from normality may or not be important, depending on how much data you have and how big the departures are. However, departures from normality are apt to be crucial if you want to predict what the most extreme value will be in future observations or in the population you have sampled from.

Emil Friedman

Posted 2010-09-08T17:47:21.820

Reputation: 631

7

I used to think that tests of normality were completely useless.

However, now I do consulting for other researchers. Often, obtaining samples is extremely expensive, and so they will want to do inference with n = 8, say.

In such a case, it is very difficult to find statistical significance with non-parametric tests, but t-tests with n = 8 are sensitive to deviations from normality. So what we get is that we can say "well, conditional on the assumption of normality, we find a statistically significant difference" (don't worry, these are usually pilot studies...).

Then we need some way of evaluating that assumption. I'm half-way in the camp that looking at plots is a better way to go, but truth be told there can be a lot of disagreement about that, which can be very problematic if one of the people who disagrees with you is the reviewer of your manuscript.

In many ways, I still think there are plenty of flaws in tests of normality: for example, we should be thinking about the type II error more than the type I. But there is a need for them.

Cliff AB

Posted 2010-09-08T17:47:21.820

Reputation: 10 748

Note that the arguments here is that the tests are only useless in theory. In theory, we can always get as many samples as we want... You'll still need the tests to prove that your data is at least somehow close to normality. – SmallChess – 2015-05-20T02:43:53.860

1Good point. I think what you're implying, and certainly what I believe, is that a measure of deviation from normality is more important than a hypothesis test. – Cliff AB – 2015-05-20T03:50:21.237

7

For what it's worth, I once developed a fast sampler for the truncated normal distribution, and normality testing (KS) was very useful in debugging the function. This sampler passes the test with huge sample sizes but, interestingly, the GSL's ziggurat sampler didn't.

Arthur B.

Posted 2010-09-08T17:47:21.820

Reputation: 1 781

6

Let me add one small thing:
Performing a normality test without taking its alpha-error into account heightens your overall probability of performing an alpha-error.

You shall never forget that each additional test does this as long as you don't control for alpha-error accumulation. Hence, another good reason to dismiss normality testing.

Henrik

Posted 2010-09-08T17:47:21.820

Reputation: 8 918

3This does not make sense to me. Even if you decide between, say, an ANOVA or a rank-based method based on a test of normality (a bad idea of course), at the end of the day you would still only perform one test of the comparison of interest. If you reject normality erroneously, you still haven't reached a wrong conclusion regarding this particular comparison. You might be performing two tests but the only case in which you can conclude that factor such-and-such have an effect is when the second test also rejects $H_0$, not when only the first one does. Hence, no alpha-error accumulation… – Gala – 2013-06-08T11:24:27.730

In a way, this bring us back to common criticisms of null-hypothesis significance testing (Why not adjust for all the tests you will perform in your career? And if yes, how can the conclusions afforded by a body of data be different depending on the intent/future career of the researcher?) but really those two tests are unrelated as they come. For example, the case to correct for a test because you published something on the same topic years ago seems a lot stronger. – Gala – 2013-06-08T11:26:40.990

Of course, if you use some inappropriate test, the error rate can be far from its nominal level but this would also be the case if you performed the test directly. The only way a normality test could increase type I errors is if the test you use when normality is rejected is in fact less robust to the particular issue with your data than the regular test. In any case, this seems all unrelated to the notion of alpha-error accumulation. – Gala – 2013-06-08T11:33:08.933

2Another way a normality test could increase type I errors is if we're talking about "overall probability of performing an alpha-error." The test itself has an error rate, so overall, our probability of committing an error increases. Emphasis on one small thing too I suppose... – Nick Stauner – 2013-11-08T15:49:21.767

2@NickStauner That is exactly what I wanted to convey. Thanks for making this point even clearer. – Henrik – 2013-11-09T12:25:24.160

I presume you are referring to a situation where one first does a normality test, and then uses the result of that test to decide which test to perform next. – Harvey Motulsky – 2010-09-09T16:07:23.570

2I refer to the general utility of normality tests when used as method to determine whether or not it is appropriate to use a certain method. If you apply them in these cases, it is, in terms of probability of committing an alpha error, better to perform a more robust test to avoid the alpha error accumulation. – Henrik – 2010-09-10T10:42:59.330

Hello Henrik, you bring an interesting case of multiple comparisons which I never thought of in this case - thanks. (+1) – Tal Galili – 2010-09-10T16:59:19.503

5

The argument you gave is an opinion. I think that the importance of normality testing is to make sure that the data does not depart severely from the normal. I use it sometimes to decide between using a parametric versus a nonparametric test for my inference procedure. I think the test can be useful in moderate and large samples (when the central limit theorem does not come into play). I tend to use Wilk-Shapiro or Anderson-Darling tests but running SAS I get them all and they generally agree pretty well. On a different note I think that graphical procedures such as Q-Q plots work equally well. The advantage of a formal test is that it is objective. In small samples it is true that these goodness of fit tests have practically no power and that makes intuitive sense because a small sample from a normal distribution might by chance look rather non normal and that is accounted for in the test. Also high skewness and kurtosis that distinguish many non normal distributions from nomrla distribution are not easily seen in small samples.

Michael Chernick

Posted 2010-09-08T17:47:21.820

Reputation: 32 399

I think we are talking opinions here. Then, in my view is a bad practice to teach that a normality test is an objective standard that checks/rejects normality. The result of a test is just an algorithm that does not informs about the validity of assumming normality and moving forward. The Q-Q plot, instead, is explicit: YOU must decide what is or is not important (deviation) and makes you wonder if maybe there is some alternative out there that make it looks better (even just a linear transformation) – FairMiles – 2013-07-28T17:15:25.627

2While it certainly can be used that way, I don't think you will be more objective than with a QQ-Plot. The subjective part with the tests is when to decide that your data is to non-normal. With a large sample rejecting at p=0.05 might very well be excessive. – Erik – 2012-05-04T17:56:01.333

3Pre-testing (as suggested here) can invalidate the Type I error rate of the overall process; one should take into account the fact that a pre-test was done when interpreting the results of whichever test it selected. More generally, hypothesis tests should be kept for testing null hypothesis one actually cares about, i.e. that there is no association between variables. The null hypothesis that the data is exactly Normal doesn't fall into this category. – guest – 2012-05-04T18:02:31.787

1(+1) There is excellent advice here. Erik, the use of "objective" took me aback too, until I realized Michael's right: two people correctly conducting the same test on the same data will always get the same p-value, but they might interpret the same Q-Q plot differently. Guest: thank you for the cautionary note about Type I error. But why should we not care about the data distribution? Frequently that is interesting and valuable information. I at least want to know whether the data are consistent with the assumptions my tests are making about them! – whuber – 2012-05-04T18:25:15.150

1I strongly disagree. Both people get the same QQ-plot and same the p-value. To interpret the p-value you need to take into account the sample size and the violations of normality your test is particular sensitive to. So deciding what to do with your p-value is just as subjective. The reason you might prefer the p-value is that you believe the data could follow a perfect normal distribution - else it is just a question how quickly the p-value falls with sample size. Which is more, given a decent sample size the QQ-plot looks pretty much the same and remains stable with more samples. – Erik – 2012-05-04T20:30:28.600

1Erik, I agree that test results and graphics require interpretation. But the test result is a number and there won't be any dispute about it. The QQ plot, however, admits of multiple descriptions. Although each may objectively be correct, the choice of what to pay attention to is...a choice. That's what "subjective" means: the result depends on the analyst, not just the procedure itself. This is why, for instance, in settings as varied as control charts and government regulations where "objectivity" is important, criteria are based on numerical tests and never graphical results. – whuber – 2012-05-04T21:54:30.117

1I am very surprised that anyone would argue that formal hypothesis testing is no more objective that studying a QQ plot. I think Bill Huber explained well what I would have said in rebuttal. I don't know if I can change Erik's mind on this but I would add that you choose a test statistic and a critical value based on a significance level that you decide on (choice of significance level could be by tradition like picking 0.05 or it may be decided by your subjective reasoning about what is the risk you want to take for committing a type I error). – Michael Chernick – 2012-05-05T17:12:37.473

1All of this can be done prior to collecting any data. At that point the decision is deterministic. You collect the data, compute the test statistic and then reject if it exceeds the critical value and you don't reject if it doesn't. You do not change anything based on the data. With the QQ plot there is no predetermined rule. Basically you create the plot based on the data and decide for yourself based on what you see whether or not you think the data follows closely to a straight line. Two people can certainly differ based on personal judgement coming from looking at the result. – Michael Chernick – 2012-05-05T17:13:01.170

4

I think the first 2 questions have been thoroughly answered but I don't think question 3 was addressed. Many tests compare the empirical distribution to a known hypothesized distribution. The critical value for the Kolmogorov-Smirnov test is based on F being completely sppecified. It can be modified to test against a parametric distribution with parameters estimated. So if fuzzier means estimating more than two parameters then the answer to the question is yes. These tests can be applied the 3 parameter families or more. Some tests are designed to have better power when testing against a specific family of distributions. For example when testing normality the Anderson-Darling or the Shapiro-Wilk test have greater power than K-S or chi square when the null hypothesized distribution is normal. Lillefors devised a test that is preferred for exponential distributions.

Michael Chernick

Posted 2010-09-08T17:47:21.820

Reputation: 32 399

3

I wouldn't say it is useless, but it really depends on the application. Note, you never really know the distribution the data is coming from, and all you have is a small set of the realizations. Your sample mean is always finite in sample, but the mean could be undefined or infinite for some types of probability density functions. Let us consider the three types of Levy stable distributions i.e Normal distribution, Levy distribution and Cauchy distribution. Most of your samples do not have a lot of observations at the tail (i.e away from the sample mean). So empirically it is very hard to distinguish between the three, so the Cauchy (has undefined mean) and the Levy (has infinite mean) could easily masquerade as a normal distribution.

kolonel

Posted 2010-09-08T17:47:21.820

Reputation: 252

1"...empirically it is very hard..." seems to argue against, rather than for, distributional testing. This is strange to read in a paragraph whose introduction suggests there are indeed uses for distributional testing. What, then, are you really trying to say here? – whuber – 2014-10-24T20:54:24.150

3I am against it, but I also want to be careful than just saying it is useless as I don't know the entire set of possible scenarios out there. There are many tests that depend on the normality assumption. Saying that normality testing is useless is essentially debunking all such statistical tests as you are saying that you are not sure that you are using/doing the right thing. In that case you should not do it, you should not do this large section of statistics. – kolonel – 2014-10-24T22:16:46.880

Thank you. The remarks in that comment seem to be better focused on the question than your original answer is! You might consider updating your answer at some point to make your opinions and advice more apparent. – whuber – 2014-10-24T22:18:59.380

@whuber No problem. Can you recommend an edit? – kolonel – 2014-10-24T22:21:08.770

You might start with combining the two posts--the answer and your comment--and then think about weeding out (or relegating to an appendix or clarifying) any material that may be tangential. For instance, the reference to undefined means as yet has no clear bearing on the question and so it remains somewhat mysterious. – whuber – 2014-10-24T22:23:39.163

@whuber Okay I will make an attempt to improve. thanks. – kolonel – 2014-10-24T22:24:27.633

2

I think a maximum entropy approach could be useful here. We can assign a normal distribution because we believe the data is "normally distributed" (whatever that means) or because we only expect to see deviations of about the same Magnitude. Also, because the normal distribution has just two sufficient statistics, it is insensitive to changes in the data which do not alter these quantities. So in a sense you can think of a normal distribution as an "average" over all possible distributions with the same first and second moments. this provides one reason why least squares should work as well as it does.

probabilityislogic

Posted 2010-09-08T17:47:21.820

Reputation: 17 954

2

Tests where "something" important to the analysis is supported by high p-values are I think wrong headed. As others pointed out, for large data sets, a p-value below 0.05 is assured. So, the test essentially "rewards" for small and fuzzy data sets and "rewards" for a lack of evidence. Something like qq plots are much more useful. The desire for hard numbers to decide things like this always (yes/no normal/not normal) misses that modeling is partially an art and how hypotheses are actually supported.

wvguy8258

Posted 2010-09-08T17:47:21.820

Reputation: 92

2It remains that a large sample that is nearly normal will have a low p-value while a smaller sample that is not nearly as normal will often not. I do not think that large p-values are useful. Again, they reward for a lack of evidence. I can have a sample with several million data points, and it will nearly always reject the normality assumption under these tests while a smaller sample will not. Therefore, I find them not useful. If my thinking is flawed please show it using some deductive reasoning on this point. – wvguy8258 – 2014-07-09T07:43:41.183

This doesn't answer the question at all. – SmallChess – 2015-02-02T00:52:11.057

-3

One good use of normality test that I don't think has been mentioned is to determine whether using z-scores is okay. Let's say you selected a random sample from a population, and you wish to find the probability of selecting one random individual from the population and get a value of 80 or higher. This can be done only if the distribution is normal, because to use z-scores, the assumption is that the population distribution is normal.

But then I guess I can see this being arguable too...

Hotaka

Posted 2010-09-08T17:47:21.820

Reputation: 986

Value of what? Mean, sum, variance, an individual observation? Only the last one relies on the assumed normality of the distribution. – whuber – 2013-09-29T16:12:06.077

i meant individual – Hotaka – 2013-09-29T16:29:58.107

1Thanks. Your answer remains so vague, though, that it is difficult to tell what procedures you are referring to and impossible to assess whether your conclusions are valid. – whuber – 2013-09-29T16:33:48.660

2The problem with this use is the same as with other uses: The test will be dependent on sample size, so, it's essentially useless. It doesn't tell you whether you can use z scores. – Peter Flom – 2014-05-31T00:24:19.213