How to interpret a QQ plot

130

182

I am working with a small dataset (21 observations) and have the following normal QQ plot in R:

enter image description here

Seeing that the plot does not support normality, what could I infer about the underlying distribution? It seems to me that a distribution more skewed to the right would be a better fit, is that right? Also, what other conclusions can we draw from the data?

JohnK

Posted 2014-06-05T10:44:37.823

Reputation: 11 114

7You're correct that it indicates right skewness. I'll try to locate some of the posts on interpreting QQ plots. – Glen_b – 2014-06-05T10:56:25.503

2You don't have to conclude; you just need to decide what to try next. Here I would consider square rooting or logging the data. – Nick Cox – 2014-06-05T12:56:51.167

9Tukey's Three-Point Method works very well for using Q-Q plots to help you identify ways to re-express a variable in a way that makes it approximately normal. For instance, picking the penultimate points in the tails and the middle point in this graphic (which I estimate to be $(-1.5,2)$, $(1.5,220)$, and $(0,70)$), you will easily find that the square root comes close to linearizing them. Thus you can infer that the underlying distribution is approximately square root normal. – whuber – 2014-06-05T13:09:13.007

3

@Glen_b The answer to my question has some information: http://stats.stackexchange.com/questions/71065/what-distribution-to-use-for-this-qq-plot and the link in the answer has another good source: stats.stackexchange.com/questions/52212/qq-plot-does-not-match-histogram

– tpg2114 – 2014-06-05T16:26:56.287

What of this? Does the QQ plot show notmally distributed data? enter image description here – David – 2015-03-02T09:17:43.477

This question is one of our best, and I'm awarding a bounty to both answers as soon as the system lets me. – shadowtalker – 2016-04-13T15:18:22.460

Answers

225

If the values lie along a line the distribution has the same shape (up to location and scale) as the theoretical distribution we have supposed.

Local behaviour: When looking at sorted sample values on the y-axis and (approximate) expected quantiles on the x-axis, we can identify from how the values in some section of the plot differ locally from an overall linear trend by seeing whether the values are more or less concentrated than the theoretical distribution would suppose in that section of a plot:

sections out of four Q-Q plots

As we see, less concentrated points increase more and more concentrated points than supposed increase less rapidly than an overall linear relation would suggest, and in the extreme cases correspond to a gap in the density of the sample (shows as a near-vertical jump) or a spike of constant values (values aligned horizontally). This allows us to spot a heavy tail or a light tail and hence, skewness greater or smaller than the theoretical distribution, and so on.

Overall apppearance:

Here's what QQ-plots look like (for particular choices of distribution) on average:

enter image description here

But randomness tends to obscure things, especially with small samples:

enter image description here

Note that at $n=21$ the results may be much more variable than shown there - I generated several such sets of six plots and chose a 'nice' set where you could kind of see the shape in all six plots at the same time. Sometimes straight relationships look curved, curved relationships look straight, heavy-tails just look skew, and so on - with such small samples, often the situation may be much less clear:

enter image description here

It's possible to discern more features than those (such as discreteness, for one example), but with $n=21$, even such basic features may be hard to spot; we shouldn't try to 'over-interpret' every little wiggle. As sample sizes become larger, generally speaking the plots 'stabilize' and the features become more clearly interpretable rather than representing noise. [With some very heavy-tailed distributions, the rare large outlier might prevent the picture stabilizing nicely even at quite large sample sizes.]

You may also find the suggestion here useful when trying to decide how much you should worry about a particular amount of curvature or wiggliness.

A more suitable guide for interpretation in general would also include displays at smaller and larger sample sizes.

Glen_b

Posted 2014-06-05T10:44:37.823

Reputation: 190 866

12This is a very practical guide, thank you very much for gathering all that information. – JohnK – 2014-06-05T12:57:03.940

why are the scales of the axes different? – Macond – 2014-12-01T07:35:35.680

Because they're data from a variety of different distributions which have different means and standard deviations (the x-axis is based on quantiles of standard normals, of course). The axis values are of no consequence, though, since normality doesn't depend on the mean or standard deviation. You could remove them without changing the point here at all. – Glen_b – 2014-12-01T09:10:08.987

2I understand that it is shape and type of deviation from linearity what matters here, but still it looks odd that both axes are labeled " ... quantiles " and one axis goes as 0.2 0.4 0.6 and the other goes as -2 -1 0 1 2. Again it looks ok that some data points are within middle 40% of a theoretical distribution, but how can they be distributed between 3% of their own distributon, as the y-axis on your lower-right-most plot suggests? – Macond – 2014-12-02T15:27:23.887

Ah I see your confusion now. The two axes do not display quantiles of the same quantity, but sample quantiles plotted against quantiles of a standard normal (population mean=0, population sd=1). We do not for a moment expect the y-axis variable to be standard normal. If the original data are normal with (population) mean $\mu$, and sd $\sigma$, then the plot should tend to look like (some noise about) a straight line, with intercept $\mu$ and slope $\sigma$. The values on the axes are not expected to be similar unless it happens that $\mu\approx 0,\sigma\approx 1$. – Glen_b – 2014-12-02T22:22:09.740

1@Macond The y-axis shows the raw values of the data, not their quantiles. I agree that standardizing the y-axis would make things much clearer, and I have no idea why R doesn't do this by default. Could someone shed some light on this? – Gordon Gustafson – 2015-02-22T19:50:25.073

@Glen_b Could you post the R code used to generate those plots? It would be very useful for those of us trying to dig deeper into how qq plots work. Perhaps it should be at the end to avoid interrupting the flow of your excellent answer. – Gordon Gustafson – 2015-02-22T19:53:36.750

@GordonGustafson It's more than eight months since I posted that answer; I sometimes can grab the code for these things for a few days, but rarely longer; I generally see the ideas as more important than the specific code for any illustration. I should be able to generate something quite close to them. Were you more interested in the first set of plots (the clean-looking "ideal" ones) or the next two sets (nice-looking real data and not-so-nice-looking real data)? – Glen_b – 2015-02-23T03:41:13.930

@Glen_b Code for the ideal plots would be quite sufficient. The reason I ask is because I struggled with the same thing Macond brought up: the values on the y-axis appear to be gibberish. Now that I've grasped that those 'Sample Quantiles' correspond to the raw data rather than any sort of quantile I can see why it isn't necessary to analyze them, but they would be useful for explaining the heavy vs. light tails plots if the standard deviation were listed (the light tails plot probably spans around 1.5 standard deviations vs around 10 for the heavy tails plot). – Gordon Gustafson – 2015-02-23T04:05:37.637

@Glen_b I suppose I'm asking for the code to better explain that concept, specifically to find the standard deviations for the light/heavy-tailed plots, so perhaps the code itself isn't necessary. – Gordon Gustafson – 2015-02-23T04:08:53.060

3@GordonGustafson in respect of your first comment to Macond there's a very good reason why you don't standardize the data -- because a QQ plot is a display of the data! It's designed to show information in the data you supply to the function (it would make as much sense to standardize the data you supply to a boxplot or a histogram). If you transform it, it's no longer a display of the data (though the shape in the plot may be similar, you no longer show the location or scale on the plot). I'm not sure what it is you think would be clearer in a standardized plot - can you clarify? – Glen_b – 2015-02-23T04:23:12.743

@Glen_b If both axes were standardized (or neither axis was standardized!) it would be easy to compare the expected distribution on the x-axis with the actual distribution on the y-axis. Since they're in different 'units', it feels like I can't think about any of the specific axis values in a meaningful way (without converting one of them using the standard deviation). Obviously your plots should continue to use this standard (no pun intended...) convention, but everything made sense to me once I estimated the heavy-tailed standard deviation and imagined axes with the same units. – Gordon Gustafson – 2015-02-23T04:40:33.567

@Glen_b It's conceivable that I'm the only one who grasped some of the concepts through that line of reasoning, but I think the fact that the axes use different units deserves a mention for newbies like me. :) – Gordon Gustafson – 2015-02-23T04:44:53.283

@Glen_b - It would be great if you could answer a similar query

– Elizabeth Susan Joseph – 2015-03-01T07:09:31.580

@GordonGustafson thank you for your clarification, axes are the data and theoretical dist. values and points are the quantiles, indeed. – Macond – 2015-03-09T14:17:03.403

Thanks for the amazing answer! I'm hopelessly confused by the plots - it seems to me that the light tailed and heavy tailed plots are not quite right? qqplot(rnorm(1000), runif(1000)) plots something similar to the light tailed plot you have, but surely uniform is heavy tailed comparing to a Gaussian distribution? Thanks! – Ziyao Wei – 2015-05-13T22:58:24.433

2

@ZiyaoWei No, a uniform really has very light tails -- arguably, no tails at all. Everything is within 2 MADs of the center. The first paragraph of this answer gives a clear, general, way to think about what 'heavier-tailed' means.

– Glen_b – 2015-05-14T03:48:38.383

49

I made a shiny app to help interpret normal QQ plot. Try this link.

In this app, you can adjust the skewness, tailedness (kurtosis) and modality of data and you can see how the histogram and QQ plot change. Conversely, you can use it in a way that given the pattern of QQ plot, then check how the skewness etc should be.

For further details, see the documentation therein.


I realized that I don't have enough free space to provide this app online. As request, I will provide all three code chunks: sample.R, server.R and ui.R here. Those who are interested in running this app may just load these files into Rstudio then run it on your own PC.

The sample.R file:

# Compute the positive part of a real number x, which is $\max(x, 0)$.
positive_part <- function(x) {ifelse(x > 0, x, 0)}

# This function generates n data points from some unimodal population.
# Input: ----------------------------------------------------
# n: sample size;
# mu: the mode of the population, default value is 0.
# skewness: the parameter that reflects the skewness of the distribution, note it is not
#           the exact skewness defined in statistics textbook, the default value is 0.
# tailedness: the parameter that reflects the tailedness of the distribution, note it is
#             not the exact kurtosis defined in textbook, the default value is 0.

# When all arguments take their default values, the data will be generated from standard 
# normal distribution.

random_sample <- function(n, mu = 0, skewness = 0, tailedness = 0){
  sigma = 1

  # The sampling scheme resembles the rejection sampling. For each step, an initial data point
  # was proposed, and it will be rejected or accepted based on the weights determined by the
  # skewness and tailedness of input. 
  reject_skewness <- function(x){
      scale = 1
      # if `skewness` > 0 (means data are right-skewed), then small values of x will be rejected
      # with higher probability.
      l <- exp(-scale * skewness * x)
      l/(1 + l)
  }

  reject_tailedness <- function(x){
      scale = 1
      # if `tailedness` < 0 (means data are lightly-tailed), then big values of x will be rejected with
      # higher probability.
      l <- exp(-scale * tailedness * abs(x))
      l/(1 + l)
  }

  # w is another layer option to control the tailedness, the higher the w is, the data will be
  # more heavily-tailed. 
  w = positive_part((1 - exp(-0.5 * tailedness)))/(1 + exp(-0.5 * tailedness))

  filter <- function(x){
    # The proposed data points will be accepted only if it satified the following condition, 
    # in which way we controlled the skewness and tailedness of data. (For example, the 
    # proposed data point will be rejected more frequently if it has higher skewness or
    # tailedness.)
    accept <- runif(length(x)) > reject_tailedness(x) * reject_skewness(x)
    x[accept]
  }

  result <- filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5)))
  # Keep generating data points until the length of data vector reaches n.
  while (length(result) < n) {
    result <- c(result, filter(mu + sigma * ((1 - w) * rnorm(n) + w * rt(n, 5))))
  }
  result[1:n]
}

multimodal <- function(n, Mu, skewness = 0, tailedness = 0) {
  # Deal with the bimodal case.
  mumu <- as.numeric(Mu %*% rmultinom(n, 1, rep(1, length(Mu))))
  mumu + random_sample(n, skewness = skewness, tailedness = tailedness)
}

The server.R file:

library(shiny)
# Need 'ggplot2' package to get a better aesthetic effect.
library(ggplot2)

# The 'sample.R' source code is used to generate data to be plotted, based on the input skewness, 
# tailedness and modality. For more information, see the source code in 'sample.R' code.
source("sample.R")

shinyServer(function(input, output) {
  # We generate 10000 data points from the distribution which reflects the specification of skewness,
  # tailedness and modality. 
  n = 10000

  # 'scale' is a parameter that controls the skewness and tailedness.
  scale = 1000

  # The `reactive` function is a trick to accelerate the app, which enables us only generate the data
  # once to plot two plots. The generated sample was stored in the `data` object to be called later.
  data <- reactive({
    # For `Unimodal` choice, we fix the mode at 0.
    if (input$modality == "Unimodal") {mu = 0}

    # For `Bimodal` choice, we fix the two modes at -2 and 2.
    if (input$modality == "Bimodal") {mu = c(-2, 2)}

    # Details will be explained in `sample.R` file.
    sample1 <- multimodal(n, mu, skewness = scale * input$skewness, tailedness = scale * input$kurtosis)
    data.frame(x = sample1)})

  output$histogram <- renderPlot({
    # Plot the histogram.
    ggplot(data(), aes(x = x)) + 
      geom_histogram(aes(y = ..density..), binwidth = .5, colour = "black", fill = "white") + 
      xlim(-6, 6) +
      # Overlay the density curve.
      geom_density(alpha = .5, fill = "blue") + ggtitle("Histogram of Data") + 
      theme(plot.title = element_text(lineheight = .8, face = "bold"))
  })

  output$qqplot <- renderPlot({
    # Plot the QQ plot.
    ggplot(data(), aes(sample = x)) + stat_qq() + ggtitle("QQplot of Data") + 
      theme(plot.title = element_text(lineheight=.8, face = "bold"))
    })
})

Finally, the ui.R file:

library(shiny)

# Define UI for application that helps students interpret the pattern of (normal) QQ plots. 
# By using this app, we can show students the different patterns of QQ plots (and the histograms,
# for completeness) for different type of data distributions. For example, left skewed heavy tailed
# data, etc. 

# This app can be (and is encouraged to be) used in a reversed way, namely, show the QQ plot to the 
# students first, then tell them based on the pattern of the QQ plot, the data is right skewed, bimodal,
# heavy-tailed, etc.


shinyUI(fluidPage(
  # Application title
  titlePanel("Interpreting Normal QQ Plots"),

  sidebarLayout(
    sidebarPanel(
      # The first slider can control the skewness of input data. "-1" indicates the most left-skewed 
      # case while "1" indicates the most right-skewed case.
      sliderInput("skewness", "Skewness", min = -1, max = 1, value = 0, step = 0.1, ticks = FALSE),

      # The second slider can control the skewness of input data. "-1" indicates the most light tail
      # case while "1" indicates the most heavy tail case.
      sliderInput("kurtosis", "Tailedness", min = -1, max = 1, value = 0, step = 0.1, ticks = FALSE),

      # This selectbox allows user to choose the number of modes of data, two options are provided:
      # "Unimodal" and "Bimodal".
      selectInput("modality", label = "Modality", 
                  choices = c("Unimodal" = "Unimodal", "Bimodal" = "Bimodal"),
                  selected = "Unimodal"),
      br(),
      # The following helper information will be shown on the user interface to give necessary
      # information to help users understand sliders.
      helpText(p("The skewness of data is controlled by moving the", strong("Skewness"), "slider,", 
               "the left side means left skewed while the right side means right skewed."), 
               p("The tailedness of data is controlled by moving the", strong("Tailedness"), "slider,", 
                 "the left side means light tailed while the right side means heavy tailedd."),
               p("The modality of data is controlledy by selecting the modality from", strong("Modality"),
                 "select box.")
               )
  ),

  # The main panel outputs two plots. One plot is the histogram of data (with the nonparamteric density
  # curve overlaid), to get a better visualization, we restricted the range of x-axis to -6 to 6 so 
  # that part of the data will not be shown when heavy-tailed input is chosen. The other plot is the 
  # QQ plot of data, as convention, the x-axis is the theoretical quantiles for standard normal distri-
  # bution and the y-axis is the sample quantiles of data. 
  mainPanel(
    plotOutput("histogram"),
    plotOutput("qqplot")
  )
)
)
)

Zhanxiong

Posted 2014-06-05T10:44:37.823

Reputation: 3 110

1Looks like your Shiny app's capacity has maxed out. Maybe you could just provide the code – rsoren – 2016-01-21T19:38:18.123

1@rsoren added, hope it helps and I am looking forward to hearing suggestions. – Zhanxiong – 2016-02-17T18:59:37.053

Nice work, @Zhanxiong I really appreciate it :) – Vilmantas – 2017-01-06T19:01:21.820

Very nice! I would suggest also adding options for changing the sample size and a degree of randomness. – Itamar – 2017-05-23T06:38:53.673

Link is not available !!!! @Zhanxiong – Alireza Sanaee – 2017-09-01T15:45:57.713

It seems that the link fails to respond after a limited number of clicks every month. That's the reason I pasted the source code here (as requested by other users who encountered the same issue as you). You can paste them to your R studio and run them on your own PC (after required packages are loaded in advance). – Zhanxiong – 2017-09-01T19:24:49.480