What are common statistical sins?

213

217

I'm a grad student in psychology, and as I pursue more and more independent studies in statistics, I am increasingly amazed by the inadequacy of my formal training. Both personal and second hand experience suggests that the paucity of statistical rigor in undergraduate and graduate training is rather ubiquitous within psychology. As such, I thought it would be useful for independent learners like myself to create a list of "Statistical Sins", tabulating statistical practices taught to grad students as standard practice that are in fact either superseded by superior (more powerful, or flexible, or robust, etc.) modern methods or shown to be frankly invalid. Anticipating that other fields might also experience a similar state of affairs, I propose a community wiki where we can collect a list of statistical sins across disciplines. Please, submit one "sin" per answer.

Mike Lawrence

Posted 2010-11-15T18:46:37.113

Reputation: 7 324

1@whuber There was some good answers, so I've merged them both. – mbq – 2011-02-06T11:02:41.973

I just gave a talk on this subject... A link to the video follows if you are interested. http://www.youtube.com/watch?v=1SNQQvY1ESo&feature=g-upl

– None – 2012-10-21T01:13:53.057

1Hi @Amanda, could you give some indication here of what's in the talk? No-one likes the possibility of being rick-rolled. – naught101 – 2012-10-21T02:17:50.833

Applying statistics where it doesn't belong is the main sin. – Aksakal – 2016-08-11T19:19:29.643

5I'm aware that "sin" is possibly inflammatory and that that some aspects of statistical analysis are not black-and-white. My intention is to solicit cases where a given commonly-taught practice is pretty clearly inappropriate. – Mike Lawrence – 2010-11-15T18:53:14.270

5You can also add biology/life sciences students to the mix if you like ;) – nico – 2010-11-15T19:03:17.187

1maybe retitle it life science statistical sins?... or something else more specific... – John – 2010-11-15T19:27:28.377

Answers

111

Failing to look at (plot) the data.

vqv

Posted 2010-11-15T18:46:37.113

Reputation: 2 139

1Very, very important! – deps_stats – 2011-02-04T17:18:08.080

1Probably the most common one. – Carlos Cinelli – 2014-03-16T00:28:04.160

+1 Well done! I'm shocked this hasn't been mentioned yet. – whuber – 2010-12-16T23:28:00.023

108

Most interpretations of p-values are sinful! The conventional usage of p-values is badly flawed; a fact that, in my opinion, calls into question the standard approaches to the teaching of hypothesis tests and tests of significance.

Haller and Krause have found that statistical instructors are almost as likely as students to misinterpret p-values. (Take the test in their paper and see how you do.) Steve Goodman makes a good case for discarding the conventional (mis-)use of the p -value in favor of likelihoods. The Hubbard paper is also worth a look.

Haller and Krauss. Misinterpretations of significance: A problem students share with their teachers. Methods of Psychological Research (2002) vol. 7 (1) pp. 1-20 (PDF)

Hubbard and Bayarri. Confusion over Measures of Evidence (p's) versus Errors (α's) in Classical Statistical Testing. The American Statistician (2003) vol. 57 (3)

Goodman. Toward evidence-based medical statistics. 1: The P value fallacy. Ann Intern Med (1999) vol. 130 (12) pp. 995-1004 (PDF)

Also see:

Wagenmakers, E-J. A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14(5), 779-804.

for some clear cut cases where even the nominally "correct" interpretation of a p-value has been made incorrect due to the choices made by the experimenter.

Update (2016): In 2016, American Statistical Association issued a statement on p-values, see here. This was, in a way, a response to the "ban on p-values" issued by a psychology journal about a year earlier.

Michael Lew

Posted 2010-11-15T18:46:37.113

Reputation: 6 938

See also: Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual. The Sage book of quantitative methodology for the social sciences, 391-408. http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf

– None – 2014-12-09T13:05:26.973

1Sadly, my medical boards had a question which encouraged misinterpretation of p-values. And they refused to correct it even after I contested in a letter. Insanity. – Ari B. Friedman – 2011-08-10T20:29:35.630

2@Michael (+1) I added links to abstracts and ungated PDFs. Hope you don't mind. – chl – 2010-11-16T10:24:17.297

@Chi Thanks! I should have done that myself. Next time... – Michael Lew – 2010-11-16T19:44:01.083

7+1, but I would like to make some critical comments. Regarding the opening line, one could just as well say that "almost all" (in the measure theoretic sense) interpretations of any well-defined concept are incorrect, because only one is correct. Second, to what do you refer when you say "the conventional usage" and "standard approaches"? These vague references sound like a straw man. They do not accord with what one can find in the literature on statistics education, for example. – whuber – 2010-11-17T13:43:58.040

4@Whuber Have a look at the Goodman paper. It accords pretty well with my experience in the field of pharmacology. Methods say "Results where P<0.05 were taken as statistical significant" and then results are presented with + for p<0.05, ++ for p<0.01 and +++ for p<0.0001. The statement implies the control of error rates a la Neyman and Pearson, but the use of different levels of p suggest Fisher's approach where the p value is an index of the strength of evidence against the null hypothesis. As Goodman points out, you cannot simultaneously control error rates and assess strength of evidence. – Michael Lew – 2010-11-18T02:41:07.073

8@Michael There are alternative, more generous interpretations of that kind of reporting. For example, the author might be aware that readers might want to apply their own thresholds of significance and therefore do the flagging of p-values to help them out. Alternatively, the author might be aware of possible multiple-comparisons problems and use the differing levels in a Bonferroni-like adjustment. Perhaps some portion of the blame for misuse of p-values should be laid at the feet of the reader, not the author. – whuber – 2010-11-18T14:04:04.577

4@Whuber I agree entirely, but only that what you suggest is true in some small fraction of cases (a restricted version of 'entirely'). There are some journals that specify that p values should be reported at one, two or three star levels rather than exact values, so those journals share some responsibility for the outcome. However, both that ill-considered requirement and the apparently naive use of p values might be a result of the lack of a clear explanation of the differences between error rates and evidence in the several introductory statistics texts that are on my shelves. – Michael Lew – 2010-11-19T03:40:25.297

@Michael All good points; thank you. And thank you for sharing the references: they make good reading (even though the papers addressed to medical practitioners belabor their arguments). – whuber – 2010-11-19T13:49:32.273

@Michael: (+1). The Goodman article was especially insightful in supporting your argument. – Christopher Aden – 2010-11-19T16:36:04.347

67

The most dangerous trap I encountered when working on a predictive model is not to reserve a test dataset early on so as to dedicate it to the "final" performance evaluation.

It's really easy to overestimate the predictive accuracy of your model if you have a chance to somehow use the testing data when tweaking the parameters, selecting the prior, selecting the learning algorithm stopping criterion...

To avoid this issue, before starting your work on a new dataset you should split your data as:

  • development set
  • evaluation set

Then split your development set as a "training development set" and "testing development set" where you use the training development set to train various models with different parameters and select the bests according to there performance on the testing development set. You can also do grid search with cross validation but only on the development set. Never use the evaluation set while model selection is not 100% done.

Once you are confident with the model selection and parameters, perform a 10 folds cross-validation on the evaluation set to have an idea of the "real" predictive accuracy of the selected model.

Also if your data is temporal, it is best to choose the development / evaluation split on a time code: "It's hard to make predictions - especially about the future."

ogrisel

Posted 2010-11-15T18:46:37.113

Reputation: 3 089

9In general it takes an enormous dataset for data splitting to be reliable. That's why stringent internal validation with the bootstrap is so attractive. – Frank Harrell – 2013-06-15T15:53:57.770

Especially when the development set is past data and the evaluation set future data. Why not, after all model tuning, train the final model with it's fixed parameters on the entire development set and predict the entire evaluation set with it. In a real scenario, you couldn't cross validate through future data the way you describe anyway, so you would use all relevant past data. – David Ernst – 2017-09-15T14:50:14.787

5I agree with this in principle but in the case of a small data set (I often have only 20-40 cases) use of a separate evaluation set is not practical. Nested cross-validation can get around this but may lead to pessimistic estimates on small data sets – BGreene – 2012-07-24T16:39:46.160

61

Reporting p-values when you did data-mining (hypothesis discovery) instead of statistics (hypothesis testing).

Neil McGuigan

Posted 2010-11-15T18:46:37.113

Reputation: 4 921

2Can you (or someone) elaborate? – antoine-sac – 2015-09-22T22:05:24.020

1

see https://en.wikipedia.org/wiki/Data_dredging

– Neil McGuigan – 2015-11-17T19:17:54.250

What about p-values corrected for multiple hypothesis testing (with some flavour of Bonferroni method or a more advanced correction)? I would tend to think it is fine, even in the context of data mining? – antoine-sac – 2015-11-19T15:46:06.330

I like the general idea, but it's a distortion to equate statistics with hypothesis testing when the latter is a subset of the former. – rolando2 – 2017-01-25T16:34:03.700

44

Testing the hypotheses $H_0: \mu=0$ versus $H_1: \mu\neq 0$ (for example in a Gaussian setting)

to justify that $\mu=0$ in a model (i.e mix "$H_0$ is not rejected" and "$H_0$ is true").

A very good example of that type of (very bad) reasoning is when you test whether the variances of two Gaussians are equal (or not) before testing if their mean are equal or not with the assumption of equal variance.

Another example occurs when you test normality (versus non normality) to justify normality. Every statistician has done that in is life ? it is baaad :) (and should push people to check robustness to non Gaussianity)

robin girard

Posted 2010-11-15T18:46:37.113

Reputation: 5 247

3I try to be statistically literate and still fall for this one from time to time.

What are the alternatives? Change your model so the old null becomes $H_1$? The only other option I can think of is power your study enough that a failure to reject the null is in practice close enough to confirming the null. E.g. if you want to make sure that adding a reagent to your cells won't kill off more than 2% of them, power to a satisfactory false negative rate. – DocBuckets – 2013-05-15T17:38:37.830

Great!! Yes, this drives me crazy.. – jpillow – 2011-07-11T06:32:45.423

@DocBuckets equivalence testing with two one sided tests is more rigorous than the power based approach. But you need to set a minimum relevant effect size below which you can speak of practical equivalence. – David Ernst – 2017-09-15T14:57:10.763

6The same logic (taking "absence of evidence in favor H1" as "evidence of absence of H1") essentially underlies all goodness-of-fit tests. The reasoning also often crops up when people state "the test was non significant, we can therefore conclude there is no effect of factor X / no influence of variable Y". I guess the sin is less severe if accompanied by reasoning about the test's power (e.g., a-priori estimation of sample size to reach a certain power given a certain relevant effect size). – caracal – 2010-11-16T23:07:06.833

If you do not make any concideration about the power, I would say claming $H_0$ is true when it is not rejected is very very bad while claming $H_1$ is true while $H_0$ is rejected is just a little wrong :). – robin girard – 2010-11-17T07:22:53.877

43

A few mistakes that bother me:

  1. Assuming unbiased estimators are always better than biased estimators.

  2. Assuming that a high $R^2$ implies a good model, low $R^2$ implies a bad model.

  3. Incorrectly interpreting/applying correlation.

  4. Reporting point estimates without standard error.

  5. Using methods which assume some sort of Multivariate Normality (such as Linear Discriminant Analysis) when more robust, better performing, non/semiparametric methods are available.

  6. Using p-value as a measure of strength between a predictor and the response, rather than as a measure of how much evidence there is of some relationship.

HairyBeast

Posted 2010-11-15T18:46:37.113

Reputation: 471

5Would you break these out into separate options? – russellpierce – 2010-12-12T21:27:42.730

40

Ritualized Statistics.

This "sin" is when you apply whatever thing you were taught, regardless of its appropriateness, because it's how things are done. It's statistics by rote, one level above letting the machine choose your statistics for you.

Examples are Intro to Statistics-level students trying to make everything fit into their modest t-test and ANOVA toolkit, or any time one finds oneself going "Oh, I have categorical data, I should use X" without ever stopping to look at the data, or consider the question being asked.

A variation on this sin involves using code you don't understand to produce output you only kind of understand, but know "the fifth column, about 8 rows down" or whatever is the answer you're supposed to be looking for.

Fomite

Posted 2010-11-15T18:46:37.113

Reputation: 15 705

To me Epigrad's description is of someone who cares inordinately about inference and neglects things such as reflection, discovery, and consideration of causality. – rolando2 – 2013-11-26T03:03:27.110

6Unfortunately, if you aren't interested in statistical inference, or are scarce on time and/or resources, the ritual does seem very appealling... – probabilityislogic – 2012-03-09T05:32:49.393

40

Not really answering the question, but there's an entire book on this subject:

Phillip I. Good, James William Hardin (2003). Common errors in statistics (and how to avoid them). Wiley. ISBN 9780471460688

onestop

Posted 2010-11-15T18:46:37.113

Reputation: 15 459

6+1 I made sure to read this book shortly after it came out. I get plenty of opportunities to make statistical mistakes so I'm always grateful to have them pointed out before I make them! – whuber – 2010-12-12T23:02:01.107

39

Dichotomization of a continuous predictor variable to either "simplify" analysis or to solve for the "problem" of non-linearity in the effect of the continuous predictor.

Mike Lawrence

Posted 2010-11-15T18:46:37.113

Reputation: 7 324

This one really bugs me. My clients do this all the time. And when I point out the error, they understand, but insist on doing it anyway – Peter Flom – 2011-02-06T15:54:50.273

2+1 and it becomes a serious sin when they start choosing the dichotomization cutoff so that it optimizes some sort of difference which is then tested. – Erik – 2013-09-16T10:36:26.687

@Iterator I for one would rather use lat/long if I had them, and the patience to interpret the results carefully. It could be quite interesting to find fine-grained / dimensional differences within countries or even cities, and it could help reduce residuals if analyzing something that changes dramatically across a border. E.g., even analyses of pro-Palestinian liberation attitudes might be more or less polarized closer to the border. Just calling it Palestine vs. Israel is a rather clumsy way to dumb down results, and I don't see what else is achieved beyond that...Less time learning stats? – Nick Stauner – 2014-03-02T21:41:45.713

4@Iterator you start to get at the real reason to aggregate (to two or more categories), which is because one has a priori theoretical reasons to believe that variance is meaningfully compartmentalized into those categories. For example, we do this all the time by assuming that collections of a trillion or so cells comprise an individual, or that a contiguous 24-hour period here on Earth is meaningfully interpreted as a unit. But arbitrarily aggregation does not just "throw out" information (e.g. statistical power), but can lead to (serious) biases about relationships between phenomena. – Alexis – 2014-06-25T19:44:44.203

2This isn't even a sin if there are two or more distinct populations. Suppose you have separable classes or sub-populations, then it can make sense to discretize. A very trivial example: Would I rather use indicators for site/location/city/country or lat/long? – Iterator – 2011-08-09T23:38:34.560

17I don't think this is really a "sin" as the results obtained are not wrong. However, it does throw away a lot of useful information so is not good practice. – Rob Hyndman – 2010-11-15T22:51:35.173

2Along these lines, using extreme groups designs over-estimates effect sizes whereas the use of a mean or median split under-estimates effect sizes. – russellpierce – 2010-11-16T02:47:46.777

39

interpreting Probability(data | hypothesis) as Probability(hypothesis | data) without the application of Bayes' theorem.

Andre Holzner

Posted 2010-11-15T18:46:37.113

Reputation: 885

5http://en.wikipedia.org/wiki/Prosecutor%27s_fallacy – finnw – 2011-12-19T23:38:24.327

30

Maybe stepwise regression and other forms of testing after model selection.

Selecting independent variables for modelling without having any a priori hypothesis behind the existing relationships can lead to logical fallacies or spurious correlations, among other mistakes.

Useful references (from a biological/biostatistical perspective):

  1. Kozak, M., & Azevedo, R. (2011). Does using stepwise variable selection to build sequential path analysis models make sense? Physiologia plantarum, 141(3), 197–200. doi:10.1111/j.1399-3054.2010.01431.x

  2. Whittingham, M. J., Stephens, P., Bradbury, R. B., & Freckleton, R. P. (2006). Why do we still use stepwise modelling in ecology and behaviour? The Journal of animal ecology, 75(5), 1182–9. doi:10.1111/j.1365-2656.2006.01141.x

  3. Frank Harrell, Regression Modeling Strategies, Springer 2001.

Ben Bolker

Posted 2010-11-15T18:46:37.113

Reputation: 17 243

29

Something I see a surprising amount in conference papers and even journals is making multiple comparisons (e.g. of bivariate correlations) and then reporting all the p<.05s as "significant" (ignoring the rightness or wrongness of that for the moment).

I know what you mean about psychology graduates, as well- I've finished a PhD in psychology and I'm still only just learning really. It's quite bad, I think psychology needs to take quantitative data analysis more seriously if we're going to use it (which, clearly, we should)

Chris Beeley

Posted 2010-11-15T18:46:37.113

Reputation: 2 565

8This is particularly important. I remember reading a study about whether Ramadan was bad for babies whose mothers were fasting. It looked plausible (less food, lower birth weight), but then I looked at the appendix. Thousands of hypotheses, and a few percent of them were in the "significant" range. You get weird "conclusions" like "it's bad for the kid if Ramadan is the 2nd, 4th or 6th month". – Carlos – 2011-01-31T13:06:21.333

27

Being exploratory but pretending to be confirmatory. This can happen when one is modifying the analysis strategy (i.e. model fitting, variable selection and so on) data driven or result driven but not stating this openly and then only reporting the "best" (i.e. with smallest p-values) results as if it had been the only analysis. This also pertains to the point if multiple testing that Chris Beeley made and results in a high false positive rate in scientific reports.

psj

Posted 2010-11-15T18:46:37.113

Reputation: 733

25

The one that I see quite often and always grinds my gears is the assumption that a statistically significant main effect in one group and a non-statistically significant main effect in another group implies a significant effect x group interaction.

russellpierce

Posted 2010-11-15T18:46:37.113

Reputation: 9 334

23

Especially in epidemiology and public health - using arithmetic instead of logarithmic scale when reporting graphs of relative measures of association (hazard ratio, odds ratio or risk ratio).

More information here.

radek

Posted 2010-11-15T18:46:37.113

Reputation: 652

5

Not to mention not labeling them at all http://xkcd.com/833/

– radek – 2010-12-13T22:18:00.607

22

Analysis of rate data (accuracy, etc) using ANOVA, thereby assuming that rate data has Gaussian distributed error when it's actually binomially distributed. Dixon (2008) provides a discussion of the consequences of this sin and exploration of more appropriate analysis approaches.

Mike Lawrence

Posted 2010-11-15T18:46:37.113

Reputation: 7 324

This is only as bad as the normal approximation to the binomial - should be fine, provided that each case is weighted by the denominator used in calculating the rate. Would expect it to perform poorly for rates below 10% and above 90%. – probabilityislogic – 2014-07-02T05:24:48.890

4How much does this decrease the power of the analysis? In what conditions is it most problematic? In many cases deviations from the assumptions of ANOVA do not substantially affect the outcomes to an important extent. – Michael Lew – 2010-11-16T07:48:44.663

What is the alternative do the ANOVA procedure? – Henrik – 2010-11-16T12:55:24.730

@Michael Lew & Henrik: I just updated this entry to include a link to Dixon (2008) – Mike Lawrence – 2010-11-16T15:48:33.303

2But in short, it is most problematic when probabilities observed are low or high as the range of values are constricted and unable to meet Gaussian assumptions. – russellpierce – 2010-11-17T15:14:31.483

21

Correlation implies causation, which is not as bad as accepting the Null Hypothesis.

suncoolsu

Posted 2010-11-15T18:46:37.113

Reputation: 5 246

3Google makes all that money not caring about causation at all. Indeed, why would it? Prediction is the thing... – conjugateprior – 2012-03-30T14:56:11.567

but sometimes... sometimes the potential directions of causation have highly disparate probabilities. I'm certainly not going to think that a correlation between age and height could be caused by the height... or some intervening variable either. Also, I think that this is one that behavioural science training is generally quite sensitive to. – John – 2010-11-15T19:21:49.960

indeed, inferring something from A and B are correlated usually only see A causes B but not B causes A...(and forget about C which causes A and B) – Andre Holzner – 2010-12-01T20:13:59.237

12google makes $65B a year not caring about the difference... – Neil McGuigan – 2010-12-03T07:41:55.810

5I agree with your points and they all are valid. But does Google's profit imply: correlation => causation? – suncoolsu – 2010-12-03T07:53:58.967

17

A current popular one is plotting 95% confidence intervals around the raw performance values in repeated measures designs when they only relate to the variance of an effect. For example, a plot of reaction times in a repeated measures design with confidence intervals where the error term is derived from the MSE of a repeated measures ANOVA. These confidence intervals don't represent anything sensible. They certainly don't represent anything about the absolute reaction time. You could use the error term to generate confidence intervals around the effect but that is rarely done.

John

Posted 2010-11-15T18:46:37.113

Reputation: 17 822

Is there a standard article that can be cited to dissuade reviewers from demanding this all-too-common practice? – russellpierce – 2010-11-26T03:56:18.223

The only critique I know is Blouin & Riopelle (2005) but they don't get to the heart of the matter. I generally don't insist on not showing them but doing something correct as in the effect graphs of Masson & Loftus (2003, see figure 4, right panel... if they were removed from the left one you'd have done it right). – John – 2010-11-26T13:41:01.367

Just to be clear, the problem with those CI's is that they're purely used for inferential reasons with respect to differences among conditions and therefore are worse even than PLSD... in fact I prefer them. At least they're honest. – John – 2010-11-26T13:42:01.457

16

While I can relate to much of what Michael Lew says, abandoning p-values in favor of likelihood ratios still misses a more general problem--that of overemphasizing probabilistic results over effect sizes, which are required to give a result substantive meaning. This type of error comes in all shapes and sizes and I find it to be the most insidious statistical mistake. Drawing on J. Cohen and M. Oakes and others, I've written a piece on this at http://integrativestatistics.com/insidious.htm .

rolando2

Posted 2010-11-15T18:46:37.113

Reputation: 7 899

@rolando2 - You still want a likelihood ratio, but when there is a substantive value that has meaning, you should really be testing on this value rather than on $0$. So if $|\beta|=1$ has physical meaning in terms of the coefficient, then you should really be testing for $|\beta|>1$ vs $|\beta|\leq 1$. And not $\beta=0$ vs $\beta\neq 0$ – probabilityislogic – 2012-03-09T05:31:11.800

3I'm actually unclear as to how a likelihood ratio (LR) does not achieve everything that an effect size achieves, while also employing an easily interpretable scale (the data contains X times more evidence for Y than for Z). An effect size is usually just some form of ratio of explained to unexplained variability, and (in the nested case) the LR is the ratio of unexplained variability between a model that has an effect and one that doesn't. Shouldn't there at least be a strong correlation between effect size and LR, and if so, what is lost by moving to the likelihood ratio scale? – Mike Lawrence – 2011-01-06T19:54:51.660

Mike - You've got me interested, but do your points extend to effect sizes as simple as mean differences between groups? These can be easily interpreted by a lay person and can also be assigned confidence intervals. – rolando2 – 2011-01-06T20:22:53.617

Ah, so by effect size, you mean absolute effect size, a value that is meaningless unto itself, but that can be made meaningful by transformation into relative effect size (by dividing by some measure of variability, as I mentioned), or by computing a confidence interval for the absolute effect size. My argument above applies to the merits of LRs vs relative effect sizes. There may utility to computing effect CIs in cases where the actual value of the effect is of interest (eg. prediction), but I still stand by the LR as a more intuitive scale for talking about evidence for/against effects. – Mike Lawrence – 2011-01-06T23:11:56.483

I guess the use of LRs vs CIs will likely vary according to the context, which may be usefully summarized as follows: More exploratory stages of science, where theories are roughly characterized by the existence/absence of phenomena, may prefer LRs to quantify evidence. On the other hand, CIs may be preferred in more advanced stages of science, where theories are sufficiently refined to permit nuanced prediction including ranges of expected effects or, conversely, when different ranges of effect magnitudes support different theories. Finally, predictions generated from any model need CIs. – Mike Lawrence – 2011-01-06T23:18:49.550

14

Failing to test the assumption that error is normally distributed and has constant variance between treatments. These assumptions aren't always tested, thus least-squares model fitting is probably often used when it is actually inappropriate.

jebyrnes

Posted 2010-11-15T18:46:37.113

Reputation: 556

9What's inappropriate about least squares estimation when the data are non-normal or heteroskedastic? It is not fully efficient, but it is still unbiased and consistent. – Rob Hyndman – 2010-11-16T03:18:20.200

3If the data are heteroscedastic you can end up with very innacurate out of sample predictions because the regression model will try too hard to minimise the error on samples in areas with high variance and not hard enough on samples from areas of low variance. This means you can end up with a very badly biased model. It also means that the error bars on the predictions will be wrong. – Dikran Marsupial – 2010-11-16T09:31:44.380

6No, it is unbiased, but the variance is larger than if you used a more efficient method for the reasons you explain. Yes, the prediction intervals are wrong. – Rob Hyndman – 2010-11-16T12:39:49.377

4Yes (I was using biased in a colloquial rather than a statistical sense to mean the model was systematically biased towards observations in high-variance regions of the feature space - mea culpa!) - it would be more accurate to say that the higher variance means there is an increased chance of getting a poor model using a finite dataset. That seems a reasonable answer to your question. I don't really view unbiasedness as being that much of a comfort - what is important is that the model should give good predictions on the data I actually have and often the variance is more important. – Dikran Marsupial – 2010-11-16T22:18:23.977

12

My intro psychometrics course in undergrad spent at least two weeks teaching how to perform a stepwise regression. Is there any situation where stepwise regression is a good idea?

Christopher Aden

Posted 2010-11-15T18:46:37.113

Reputation: 1 095

I'm working on a project that uses stepwise regression. The reason is because I have D >> N, where D is dimensionality and N is sample size (thus ruling out using one model with all the variables), subsets of the features are highly correlated with each other, I want a statistically principled way of selecting maybe 2-3 "best" features, and I don't intend to report the P-values, at least without some kind of fairly conservative correction. – dsimcha – 2011-01-29T04:16:14.197

6"Good idea" depends on the situation. When you want to maximize prediction it isn't a horrible idea - though it may lead to over fitting. There are some rare cases where it is inevitable - where there is no theory to guide the model selection. I wouldn't count stepwise regression as a "sin" but using it when theory is sufficient to drive model selection is. – russellpierce – 2010-11-16T02:52:32.267

19Perhaps the sin is doing statistical tests on a model obtained via stepwise regression. – Rob Hyndman – 2010-11-16T03:19:25.860

3It's fine if you use cross-validation, and don't extrapolate. Don't publish the p-values though, as they are meaningless. – Neil McGuigan – 2010-11-30T06:16:54.457

11

This may be more of a pop-stats answer than what you're looking for, but:

Using the mean as an indicator of location when data is highly skewed.

This isn't necessarily a problem, if you and your audience knows what you're talking about, but this generally isn't the case, and the median is often likely to give a better idea of what's going on.

My favourite example is mean wages, which are usually reported as "average wages". Depending on the income/wealth inequality in a country, this can be vastly different from the median wage, which gives a much better indicator for where people are at in real life. For example, in Australia, where we have relatively low inequality, the median is 10-15% lower than the mean. In the US the difference is much starker, the median is less than 70% of the mean, and the gap is increasing.

Reporting on the "average" (mean) wage results in a rosier picture than is warranted, and could also give a large number of people the false impression that they aren't earning as much as "normal" people.

naught101

Posted 2010-11-15T18:46:37.113

Reputation: 2 163

1This is not just related to skewness, but is a general problem that the mean, or any other measure of central tendency is not enough without considering dispersion. For example, if the medians of two groups were equal, but the inter quartile range was 100 times as big for one population. Just looking at the median, you would say they're the "same population distribution", when in reality they would be very different. Not to mention multiple modes creating problems... – probabilityislogic – 2014-07-02T04:48:41.930

But, for some purposes mean is relevant: wage is an extensive variable, meaning that sums of wages are meaningful. For questions where total wage income of some (sub)group is relevant, means are the right thing: The total can be recovered from the mean, not from the median. – kjetil b halvorsen – 2015-08-16T12:19:30.050

@kjetilbhalvorsen: Why not just use the total then? – naught101 – 2015-08-17T00:45:00.350

@naught101: Because diffwrent (sub)groups generaly will have diferent $n$s. – kjetil b halvorsen – 2015-08-17T09:02:55.530

There's a semi-related discussion of this as it applies to trend analysis here: https://tamino.wordpress.com/2012/03/29/to-robust-or-not-to-robust-that-is-the-question/

– naught101 – 2012-04-11T01:05:40.757

10

My old stats prof had a "rule of thumb" for dealing with outliers: If you see an outlier on your scatterplot, cover it up with your thumb :)

Neil McGuigan

Posted 2010-11-15T18:46:37.113

Reputation: 4 921

This is akin to Winsorization which isn't too terrible. – Ari B. Friedman – 2011-08-10T20:28:23.353

9

That the p-value is the probability that the null hypothesis is true and (1-p) is the probability that the alternative hypothesis is true, of that failing to reject the null hypothesis means the alternative hypothesis is false etc.

Dikran Marsupial

Posted 2010-11-15T18:46:37.113

Reputation: 32 924

1Interestingly, Atkin shows thhe pvalue is the posterior probability that the likelihood ratio is less than $1$ (for the fixed data that was observed) – probabilityislogic – 2012-05-05T01:36:24.833

Interesting, can you give me a reference to read up about it? – Dikran Marsupial – 2012-05-05T14:04:38.707

2

(here you go)[http://www.ece.uvic.ca/~bctill/papers/mocap/Aitkin_1997.pdf] personally, while I do find it interesting, I struggle with the question of why the posterior distribution of the likelihood ratio is the quantity of interest.

– probabilityislogic – 2012-05-06T00:08:44.527

8

Repeating the same or similar experiment over 20 times on the same data and then reporting a statistically significant result with $\alpha = 0.05$. Incidentally there is a comic about this one.

And similarly to (or almost the same as) @ogrisel's answer, performing a Grid search and reporting only the best result.

Andrew

Posted 2010-11-15T18:46:37.113

Reputation: 883

I think you meant to link to a different comic, though that's an immortal one. – rolando2 – 2017-01-25T16:39:47.133

Possibly, if I remember well enough what I had in mind back then: https://xkcd.com/882/

– Andrew – 2017-01-25T20:27:47.753

8

(With a bit of luck this will be controversial.)

Using a Neyman-Pearson approach to statistical analysis of scientific experiments. Or, worse, using an ill-defined hybrid of Neyman-Pearson and Fisher.

Michael Lew

Posted 2010-11-15T18:46:37.113

Reputation: 6 938

sorry to be ignorant, but what's wrong with a Neyman-Pearson construction for the analysis of (the outcome of) scientific experiments ? – Andre Holzner – 2010-12-01T20:15:43.133

@Andre I think this remark may be closely related to another one offered by @Michael Lew elsewhere in this thread (http://stats.stackexchange.com/questions/4551/what-are-common-statistical-sins/4567#4567 ).

– whuber – 2010-12-12T23:04:48.967

8

Using pie charts to illustrate relative frequencies. More here.

Andrej

Posted 2010-11-15T18:46:37.113

Reputation: 1 090

2Would be good to include some reasoning on-site. – naught101 – 2012-04-10T12:32:24.497

8

In similar vein to @dirkan - The use of p-values as a formal measure of evidence of the null hypothesis being true. It does have some good heuristic and intuitively good features, but is essentially an incomplete measure of evidence because it makes no reference to the alternative hypothesis. While the data may be unlikely under the null (leading to a small p-value), the data may be even more unlikely under the alternative hypothesis.

The other problem with p-values, which also relates to some styles of hypothesis testing, is there is no principle telling you which statistic you should choose, apart from the very vague "large value" $\rightarrow$ "unlikely if null hypothesis is true". Once again, you can see the incompleteness showing up, for you should also have "large value" $\rightarrow$ "likely if alternative hypothesis is true" as an additional heuristic feature of the test statistic.

probabilityislogic

Posted 2010-11-15T18:46:37.113

Reputation: 17 954

I am not answering because I don't want to go to the trouble of thinking one up and for that matter wading through all the ones already given to make sure I don't repeat one! But I think i can be helpful. There is a book by Good and Hardin called "Common Errors in Statistics and How to Avoid Them." You can find a lot of great examples there. It is a popular book that is already going into its fourth edition. – Michael Chernick – 2012-05-04T18:04:35.533

Also Altman's book with Chapman & Hall/CRC "Practical Statistics in Medical Research" has a chapter on the medical literature where many statistical sins are revealed that occurred in published papers. – Michael Chernick – 2012-05-04T18:04:42.553

8

Using statistics/probability in hypothesis testing to measure the "absolute truth". Statistics simply cannot do this, they can only be of use in deciding between alternatives, which must be specified from "outside" the statistical paradigm. Statements such as "the null hypothesis is proved true by the statistics" are just incorrect; statistics can only tell you "the null hypothesis is favoured by the data, compared to the alternative hypothesis". If you then assume that either the null hypothesis or the alternative must be true, you can say "the null proved true", but this is only a trivial consequence of your assumption, not anything demonstrated by the data.

probabilityislogic

Posted 2010-11-15T18:46:37.113

Reputation: 17 954

7

Requesting, and perhaps obtaining The Flow Chart: That graphical thing where you say what the level of your variables are and what sort of relationship you're looking for, and you follow the arrows down to get a Brand Name Test or a Brand Name Statistic. Sometimes offered with mysterious 'parametric' and 'non-parametric' paths.

conjugateprior

Posted 2010-11-15T18:46:37.113

Reputation: 15 813

7

Perhaps the poor teaching of statistics to end consumers. The fact is that most courses have given a medieval menu, not including new theoretical developments, computational and best practices, insufficient teaching of modern and complete analysis of real data sets, at least in poor and developing countries, what is the situation in developed countries?

Washington S. Silva

Posted 2010-11-15T18:46:37.113

Reputation: 461

3The situation in developed countries is exactly the same. – Flounderer – 2014-08-17T01:41:40.493

7

Temptation to use advanced statistical methods without understanding them, just because they sound impressive or because they happen to better support researcher's initial hypothesis.

When one uses an advanced method he or she should have solid reasons as to why the method is appropriate.

Akavall

Posted 2010-11-15T18:46:37.113

Reputation: 1 390

6

In psychology, the cardinal sin (for me) is the use of principal components analysis to examine the hypothesised latent structure underlying a psychometric test.

Not testing for normality before using tests which assume this.

richiemorrisroe

Posted 2010-11-15T18:46:37.113

Reputation: 2 523

https://stats.stackexchange.com/questions/2492/is-normality-testing-essentially-useless ... I especially recommend Harvey's answer there. – Glen_b – 2017-10-18T04:41:45.243

1Take a look at the sin I have already proposed about testing. If you do the test and do not reject normality, it does not mean that you have a normal sample... it only means that you cannot say the sample is not normal. sin ! – robin girard – 2010-12-01T16:04:56.537

6

Probably not as applicable to psych stats (or is it? I'm not sure) but failing to account for a split plot design in an analysis of an experiment. I've seem way too many people do this.

Dason

Posted 2010-11-15T18:46:37.113

Reputation: 400

A pre-post, experimental-control group design is extremely common in psychology. I agree that few people seem to be aware of the comparatively complicated and strict assumptions. Mixed models often seem to be beyond the statistical horizon. – caracal – 2010-12-01T21:36:20.573

5

I would say, doing tests and regressions on a small set of data.
Edit: Without looking at the confidence intervals, or when the confidence intervals/error bars are not easy to calculate.

RockScience

Posted 2010-11-15T18:46:37.113

Reputation: 1 353

4Perhaps I don't see why this is such a problem. Hypothesis testing a small sample size using a normal distribution, sure, but using a more conservative/nonparametric test, is this so bad? – Christopher Aden – 2010-11-16T10:18:30.767

I agree that using a more conservative model to fit the data is the best we can do. But in any case you will have to trust this model. It will be a fitting, not a model. A model requires a representative set of data otherwise it may not work in the future. – RockScience – 2010-11-17T02:17:33.240

If you use Bayesian regression then the error bars also indicate the uncertainty due to the finite nature of the dataset (given the prior), you only trust the model as far as the error bars suggest you ought to trust it. If you don't have enough data to make a useful inference it will generally be evident in the posterior distributions for the parameters and/or predictions. The usual frequentist error bars will probably say pretty much the same thing. At the end of the day, sometimes only a small dataset is available, it just limits the confidence in your conlcusions. – Dikran Marsupial – 2010-11-19T09:14:11.553

Agreed for the Bayesian regression. Thanks for pointing that out. But if you have two points that form a straight line, how do you calculate the frequentist error bars? And let's say that you have enough points to calculate the frequentist error bars, from how many points can you trust them (should we use the error bars of the error bars?) – RockScience – 2010-11-19T09:26:57.503

Sorry, I am not that familiar with the limitations of frequentist error bars (I only mentioned them so as not to appear bigoted ;o). IIRC frequentist error bars for ridge regression are very similar to the Bayesian ones for a fixed value of the ridge parameter (a full Bayesian analysis would marginalise over the ridge parameter as well). – Dikran Marsupial – 2010-11-19T09:50:49.527

1Isn't performing a test implicitly looking at the confidence intervals? Perhaps it would be that the sin test-wise is ignoring the power of the test? – Dikran Marsupial – 2010-11-19T13:11:43.460

5

Using Analysis of Covariance (ANCOVA) to try to "control for" or "regress out" the influence of a covariate that is known to be correlated with, or affect the influence of, other predictor variables. More discussion at this question: Checking factor/covariate independence in ANCOVA

Mike Lawrence

Posted 2010-11-15T18:46:37.113

Reputation: 7 324

4

This isn't generally considered a sin but I hope it will be one day: using a bad model that doesn't describe reality just because it's "interpretable."

dsaxton

Posted 2010-11-15T18:46:37.113

Reputation: 8 355

4

Rush into modeling before spending enough time on understanding and preprocessing the data.

Aliweb

Posted 2010-11-15T18:46:37.113

Reputation: 150

This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - From Review

– kjetil b halvorsen – 2016-11-28T08:26:19.143

This is probably adequate, IMO, given the context of the thread to which it was posted. I think it's OK. @Aliweb, you might want to develop / elaborate the idea a little, though. – gung – 2016-11-28T13:06:57.247

@kjetilbhalvorsen This exactly answers the question. The author is asking for some statistical sins and if you read the question, my answer couldn't be seen as a critic to the author. – Aliweb – 2016-11-28T17:42:15.177

1@gung I tried to answer the question the same way many others have done and considering what the asker is asking for. I think elaborating the idea would be kind of repeating myself or I should start talking about "how to understand/preprocess the data" which is not what the question is about. – Aliweb – 2016-11-28T17:44:36.117

1@Aliweb, that's fair. I only said "might"; if you can't go further w/o repeating yourself or moving off topic, then you're best staying put. – gung – 2016-11-28T18:31:42.700

I upvoted the answer though I bet many of us would benefit from seeing an example or 2. – rolando2 – 2017-01-25T16:44:44.697

4

Interpreting a $100\alpha \%$ Confidence Interval $I$ as the probability of finding the "real" parameter inside the interval.

The most common case is when someone calculates this C.I. ($I$) and interprets the number $\alpha$ as the probability of finding the "true mean" say, $\mu$, inside the interval, i.e., interpreting the C.I. as $P(\mu \in I)=\alpha$.

Néstor

Posted 2010-11-15T18:46:37.113

Reputation: 2 634

3Is this such a bad thing? I understand the usual argument of "the mean is either in the interval or it isn't", or at least I think I do. But the endpoints of the confidence interval are random variables, so why is it wrong to talk about the probability that they take values above and below the true mean? – mark999 – 2012-04-08T22:43:34.040

It IS bad because you are giving a probability statement that doesn't exist. The phrase "the true population mean $\mu$ is either in the interval or isn't" states a beautiful fact: $P(\mu \in I)$ is either 1 or 0.

On the other hand, recall that what you calculate when creating confidence intervals is $P(L_1<\bar{X}-\mu<L_2)=\alpha$, where $\bar{X}$ is a STATISTIC (i.e. a function of random variables) and is, therefore, the random variable on which you are calculating the probability. The 100$\alpha$% confidence interval that you calculate is one of many other (random) intervals that... – Néstor – 2012-04-09T03:58:15.360

...might "appear" as you sample from the population. Simply put, $I$ is one realization of many random intervals (say, $I_r$) that are generated as you sample from the population. You can only say that $I_r$ is a range in which the mean will occur 95% of the time, but that says nothing about your particular realization (or estimation) of interval $I$ that one usually calculates, and, therefore, says nothing about the probability of $I$ containing $\mu$. – Néstor – 2012-04-09T04:06:57.843

2Thanks for the explanation, I think I understand now. I wasn't suggesting that a particular realisation of a 95% confidence interval has 95% chance of containing the true value, although I didn't make that clear. What I meant was that saying "the probability that the (generic) interval contains the true value is 0.95" seems to me to be equivalent to saying "if repeated many times, 95% of the intervals will contain the true value". – mark999 – 2012-04-09T04:55:03.887

Oh, yes! That is in fact true :-). Just a misunderstanding then. – Néstor – 2012-04-09T04:57:47.943

4

Over-interpreting OLS regression in the presence of known outliers. If you know that there are particular data in your dataset which are generated by a different process to the process that generates most of the data, and this different process generates wildly different results which show up as outliers, then you have to be very careful in interpreting the model output because the outliers often do substantially move the OLS results. That's not to say OLS is bad, just that you need to think about the data when interpreting the results.

What's worse is that we often have "never throw away outliers" as common advice to early students. Sometimes it translates into an attitude of keeping the data, warts and all, without really discussing anomalies and outliers.

Better advice might be: "use a mixture model" or "use Huber/quantile-based/other robust techniques" or "go Bayes and use a hierarchical model". But everyone should at least learn to just "reanalyse without the suspect outliers and print both analyses and show us a plot" or even "talk qualitatively for a bit about outliers in the conclusion of your paper and suggest it might be a good idea to redo the experiment with fewer foul-ups".

Patrick Caldon

Posted 2010-11-15T18:46:37.113

Reputation: 1 138

Hah. I already added this as a comment to my answer, but it's more relevant to this answer: https://tamino.wordpress.com/2012/03/29/to-robust-or-not-to-robust-that-is-the-question/ discusses situations where OLS may be better than robust regression, even in cases with (apparent) outliers.

– naught101 – 2012-04-11T01:28:54.183

4

Application of least-squares minimization when maximum-likelihood procedures exist.

Mike Lawrence

Posted 2010-11-15T18:46:37.113

Reputation: 7 324

3@MikeLawrence - "maximum likelihood" = "weighted least squares" in many cases. even maximum entropy is approximately least squares (when the initial measure isn't too far from the optimum measure). You could do a lot worse than least squares... – probabilityislogic – 2012-03-09T05:43:06.650

3Would you please explain the consequences of this sin? – russellpierce – 2010-11-16T02:50:28.617

1If the data are generated by a process with a heteroscedastic noise process, the regression model is likely to give very inacurate out-of-sample predictions. – Dikran Marsupial – 2010-11-16T09:26:46.770

This entry was originally inspired by the observation that some folks estimate non-linear psychometric functions by minimizing least squares. For example, Murd et al (2009, http://www.perceptionweb.com/abstract.cgi?id=p6145 , free pdf available by googling the title) fit a probit function through data by minimizing least-squares.

– Mike Lawrence – 2010-11-16T16:04:32.697

So, should the "answer" be amended to "application of least-squares minimization on heteroscedatic data"? – russellpierce – 2010-11-17T15:13:00.360

if the noise process is skewed but homoscedastic then it is still inappropriate if you need the error bars. – Dikran Marsupial – 2010-11-19T13:10:33.167

3

Interpreting a statistically significant result as "meaningfully large".

TrynnaDoStat

Posted 2010-11-15T18:46:37.113

Reputation: 5 117

3

Specifically in psychology, and even more so in marketing, the technique of partial least squares (PLS) is used to "fit" structural equation models and path models, despite being deficient on almost any imaginable performance metric. See McIntosh, Edwards and Antonakis (2014 ORM), Rönkkö, McIntosh and Antonakis (2015 PID) and/or Rönkkö, McIntosh, Antonakis and Edwards (2016 JOM) for detailed treatment, including spelling out some natural requirements like bias and consistency, and demonstration of how PLS fails them (compared to other regression-type methods such as Bollen's model-implied instrumental variables). (I don't know how substitutable the papers are for one another, though; they must cover very similar topics, but may be aimed at somewhat different audiences.)

StasK

Posted 2010-11-15T18:46:37.113

Reputation: 24 230

Didn't you just list a different paper from the chemometrics literature somewhere? It might be worth adding to the list. – gung – 2016-08-11T20:00:31.303

Good point -- it was @amoeba who was familiar with the chemometrics literature; I am not qualified to speak on that. My understanding is that it is used with much less fanfare there than in psych and marketing, just to honestly reduce dimensionality rather than to claim that you found some underlying factors and quantified them perfectly. – StasK – 2016-08-11T21:08:25.157

3

Not paying attention to levels of measurement, and treating polytomous nominal scales as though they were ordinal, interval, or ratio scales (Ouch).

StatisticsDoc Consulting

Posted 2010-11-15T18:46:37.113

Reputation: 468

2

Using technical replication instead of true replication, and similarly, using MSE as the denominator in a nested ANOVA F-statistic.

Rik

Posted 2010-11-15T18:46:37.113

Reputation: 130

2

Completely forgetting about checking calibration or normalization, when datasets come for different sensors, different times, different observers.

Laurent Duval

Posted 2010-11-15T18:46:37.113

Reputation: 1 349