## A Dialog Between a Teacher and a Thoughtful Student

Humbly submitted in the belief that not enough crayons have been used so far in this thread. A brief illustrated synopsis appears at the end.

*Student*: What does a p-value mean? A lot of people seem to agree it's the chance we will "see a sample mean greater than or equal to" a statistic or it's "the probability of observing this outcome ... given the null hypothesis is true" or where "my sample's statistic fell on [a simulated] distribution" and even "the probability of observing a test statistic at least as large as the one calculated assuming the null hypothesis is true".

*Teacher*: Properly understood, all those statements are correct in many circumstances.

*Student*: I don't see how most of them are relevant. Didn't you teach us that we have to state a null hypothesis $H_0$ and an alternative hypothesis $H_A$? How are they involved in these ideas of "greater than or equal to" or "at least as large" or the very popular "more extreme"?

*Teacher*: Because it can seem complicated in general, would it help for us to explore a concrete example?

*Student*: Sure. But please make it a realistic but simple one if you can.

*Teacher*: This theory of hypothesis testing historically began with the need of astronomers to analyze observational errors, so how about starting there. I was going through some old documents one day where a scientist described his efforts to reduce the measurement error in his apparatus. He had taken a lot of measurements of a star in a known position and recorded their displacements ahead of or behind that position. To visualize those displacements, he drew a histogram that--when smoothed a little--looked like this one.

*Student*: I remember how histograms work: the vertical axis is labeled "Density" to remind me that the relative frequencies of the measurements are represented by *area* rather than height.

*Teacher*: That's right. An "unusual" or "extreme" value would be located in a region with pretty small area. Here's a crayon. Do you think you could color in a region whose area is just one-tenth the total?

*Student*: Sure; that's easy. [Colors in the figure.]

*Teacher*: Very good! That looks like about 10% of the area to me. Remember, though, that the only areas in the histogram that matter are those between vertical lines: they represent the *chance* or *probability* that the displacement would be located between those lines *on the horizontal axis.* That means you needed to color all the way down to the bottom and that would be over half the area, wouldn't it?

*Student*: Oh, I see. Let me try again. I'm going to want to color in where the curve is really low, won't I? It's lowest at the two ends. Do I have to color in just one area or would it be ok to break it into several parts?

*Teacher*: Using several parts is a smart idea. Where would they be?

*Student* (pointing): Here and here. Because this crayon isn't very sharp, I used a pen to show you the lines I'm using.

*Teacher*: Very nice! Let me tell you the rest of the story. The scientist made some improvements to his device and then he took additional measurements. He wrote that the displacement of the first one was only $0.1$, which he thought was a good sign, but being a careful scientist he proceeded to take more measurements as a check. Unfortunately, those other measurements are lost--the manuscript breaks off at this point--and all we have is that single number, $0.1$.

*Student*: That's too bad. But isn't that much better than the wide spread of displacements in your figure?

*Teacher*: That's the question I would like you to answer. To start with, what should we posit as $H_0$?

*Student*: Well, a sceptic would wonder whether the improvements made to the device had any effect at all. The burden of proof is on the scientist: he would want to show that the sceptic is wrong. That makes me think the null hypothesis is kind of bad for the scientist: it says that all the new measurements--including the value of $0.1$ we know about--ought to behave as described by the first histogram. Or maybe even worse than that: they might be even more spread out.

*Teacher*: Go on, you're doing well.

*Student*: And so the alternative is that the new measurements would be *less* spread out, right?

*Teacher*: Very good! Could you draw me a picture of what a histogram with less spread would look like? Here's another copy of the first histogram; you can draw on top of it as a reference.

*Student* (drawing): I'm using a pen to outline the new histogram and I'm coloring in the area beneath it. I have made it so most of the curve is close to zero on the horizontal axis and so most of its area is near a (horizontal) value of zero: that's what it means to be less spread out or more precise.

*Teacher*: That's a good start. But remember that a histogram showing *chances* should have a total area of $1$. The total area of the first histogram therefore is $1$. How much area is inside your new histogram?

*Student*: Less than half, I think. I see that's a problem, but I don't know how to fix it. What should I do?

*Teacher*: The trick is to make the new histogram *higher* than the old so that its total area is $1$. Here, I'll show you a computer-generated version to illustrate.

*Student*: I see: you stretched it out vertically so its shape didn't really change but now the red area and gray area (including the part under the red) are the same amounts.

*Teacher*: Right. You are looking at a picture of the null hypothesis (in blue, spread out) and *part* of the alternative hypothesis (in red, with less spread).

*Student*: What do you mean by "part" of the alternative? Isn't it just *the* alternative hypothesis?

*Teacher*: Statisticians and grammar don't seem to mix. :-) Seriously, what they mean by a "hypothesis" usually is a whole big set of possibilities. Here, the alternative (as you stated so well before) is that the measurements are "less spread out" than before. But *how much less*? There are many possibilities. Here, let me show you another. I drew it with yellow dashes. It's in between the previous two.

*Student*: I see: you can have different amounts of spread but you don't know in advance how much the spread will really be. But why did you make the funny shading in this picture?

*Teacher*: I wanted to highlight where and how the histograms differ. I shaded them in gray where the alternative histograms are *lower* than the null and in red where the alternatives are *higher*.

*Student*: Why would that matter?

*Teacher*: Do you remember how you colored the first histogram in both the tails? [Looking through the papers.] Ah, here it is. Let's color this picture in the same way.

*Student*: I remember: those are the extreme values. I found the places where the null density was as small as possible and colored in 10% of the area there.

*Teacher*: Tell me about the alternatives in those extreme areas.

*Student*: It's hard to see, because the crayon covered it up, but it looks like there's almost no chance for any alternative to be in the areas I colored. Their histograms are right down against value axis and there's no room for any area beneath them.

*Teacher*: Let's continue that thought. If I told you, hypothetically, that a measurement had a displacement of $-2$, and asked you to pick which of these three histograms was the one it most likely came from, which would it be?

*Student*: The first one--the blue one. It's the most spread out and it's the only one where $-2$ seems to have any chance of occurring.

*Teacher*: And what about the value of $0.1$ in the manuscript?

*Student*: Hmmm... that's a different story. All three histograms are pretty high above the ground at $0.1$.

*Teacher*: OK, fair enough. But suppose I told you the value was somewhere near $0.1$, like between $0$ and $0.2$. Does that help you read some probabilities off of these graphs?

*Student*: Sure, because I can use areas. I just have to estimate the areas underneath each curve between $0$ and $0.2$. But that looks pretty hard.

*Teacher*: You don't need to go that far. Can you just tell which area is the largest?

*Student*: The one beneath the tallest curve, of course. All three areas have the same base, so the taller the curve, the more area there is beneath it and the base. That means the tallest histogram--the one I drew, with the red dashes--is the likeliest one for a displacement of $0.1$. I think I see where you're going with this, but I'm a little concerned: don't I have to look at *all* the histograms for *all* the alternatives, not just the one or two shown here? How could I possibly do that?

*Teacher*: You're good at picking up patterns, so tell me: as the measurement apparatus is made more and more precise, what happens to its histogram?

*Student*: It gets narrower--oh, and it has to get taller, too, so its total area stays the same. That makes it pretty hard to compare the histograms. The alternative ones are *all* higher than the null right at $0$, that's obvious. But at other values sometimes the alternatives are higher and sometimes they are lower! For example, [pointing at a value near $3/4$], right here *my* red histogram is the lowest, the yellow histogram is the highest, and the original null histogram is between them. But over on the right the null is the highest.

*Teacher*: In general, comparing histograms is a complicated business. To help us do it, I have asked the computer to make another plot: it has *divided* each of the alternative histogram heights (or "densities") by the null histogram height, creating values known as "likelihood ratios." As a result, a value greater than $1$ means the alternative is more likely, while a value less than $1$ means the alternative is less likely. It has drawn yet one more alternative: it's more spread out than the other two, but still less spread out than the original apparatus was.

*Teacher* (continuing): Could you show me where the alternatives tend to be more likely than the null?

*Student* (coloring): Here in the middle, obviously. And because these are not histograms anymore, I guess we should be looking at heights rather than areas, so I'm just marking a range of values on the horizontal axis. But how do I know how much of the middle to color in? Where do I stop coloring?

*Teacher*: There's no firm rule. It all depends on how we plan to use our conclusions and how fierce the sceptics are. But sit back and think about what you have accomplished: you now realize that outcomes with large likelihood ratios are evidence *for* the alternative and outcomes with small likelihood ratios are evidence *against* the alternative. What I will ask you to do is to color in an area that, insofar as is possible, has a small chance of occurring under the null hypothesis and a relatively large chance of occurring under the alternatives. Going back to the first diagram you colored, way back at the start of our conversation, you colored in the two tails of the null because they were "extreme." Would they still do a good job?

*Student*: I don't think so. Even though they were pretty extreme and rare under the null hypothesis, they are practically impossible for any of the alternatives. If my new measurement were, say $3.0$, I think I would side with the sceptic and deny that any improvement had occurred, even though $3.0$ was an unusual outcome in any case. I want to change that coloring. Here--let me have another crayon.

*Teacher*: What does that represent?

*Student*: We started out with you asking me to draw in just 10% of the area under the original histogram--the one describing the null. So now I drew in 10% of the area where the alternatives seem more likely to be occurring. I think that when a new measurement is in that area, it's telling us we ought to believe the alternative.

*Teacher*: And how should the sceptic react to that?

*Student*: A sceptic never has to admit he's wrong, does he? But I think his faith should be a little shaken. After all, we arranged it so that although a measurement *could* be inside the area I just drew, it only has a 10% chance of being there when the null is true. And it has a larger chance of being there when the alternative is true. I just can't tell you *how* much larger that chance is, because it would depend on how much the scientist improved the apparatus. I just know it's larger. So the evidence would be against the sceptic.

*Teacher*: All right. Would you mind summarizing your understanding so that we're perfectly clear about what you have learned?

*Student*: I learned that to compare alternative hypotheses to null hypotheses, we should compare their histograms. We divide the densities of the alternatives by the density of the null: that's what you called the "likelihood ratio." To make a good test, I should pick a small number like 10% or whatever might be enough to shake a sceptic. Then I should find values where the likelihood ratio is as high as possible and color them in until 10% (or whatever) has been colored.

*Teacher*: And how would you use that coloring?

*Student*: As you reminded me earlier, the coloring has to be between vertical lines. Values (on the horizontal axis) that lie under the coloring are evidence against the null hypothesis. Other values--well, it's hard to say what they might mean without taking a more detailed look at all the histograms involved.

*Teacher*: Going back to the value of $0.1$ in the manuscript, what would you conclude?

*Student*: That's within the area I last colored, so I think the scientist probably was right and the apparatus really was improved.

*Teacher*: One last thing. Your conclusion was based on picking 10% as the criterion, or "size" of the test. Many people like to use 5% instead. Some prefer 1%. What could you tell them?

*Student*: I couldn't do all those tests at once! Well, maybe I could in a way. I can see that no matter what size the test should be, I ought to start coloring from $0$, which is in this sense the "most extreme" value, and work outwards in both directions from there. If I were to stop right at $0.1$--the value actually observed--I think I would have colored in an area somewhere between $0.05$ and $0.1$, say $0.08$. The 5% and 1% people could tell right away that I colored too much: if they wanted to color just 5% or 1%, they could, but they wouldn't get as far out as $0.1$. They wouldn't come to the same conclusion I did: they would say there's not enough evidence that a change actually occurred.

*Teacher*: You have just told me what all those quotations at the beginning *really* mean. It should be obvious from this example that they cannot possibly intend "more extreme" or "greater than or equal" or "at least as large" in the sense of having a bigger *value* or even having a value where the null density is small. They really mean these things in the sense of *large likelihood ratios* that you have described. By the way, the number around $0.08$ that you computed is called the "p-value." It can only properly be understood in the way you have described: with respect to an analysis of relative histogram heights--the likelihood ratios.

*Student*: Thank you. I'm not confident I fully understand all of this yet, but you have given me a lot to think about.

*Teacher*: If you would like to go further, take a look at the Neyman-Pearson Lemma. You are probably ready to understand it now.

## Synopsis

Many tests that are based on a single statistic like the one in the dialog will call it "$z$" or "$t$". These are ways of hinting what the null histogram looks like, but they are only hints: what we name this number doesn't really matter. The construction summarized by the student, as illustrated here, shows how it is related to the p-value. The p-value is the smallest test size that would cause an observation of $t=0.1$ to lead to a rejection of the null hypothesis.

*In this figure, which is zoomed to show detail, the null hypothesis is plotted in solid blue and two typical alternatives are plotted with dashed lines. The region where those alternatives tend to be much larger than the null is shaded in. The shading starts where the relative likelihoods of the alternatives are greatest (at $0$). The shading stops when the observation $t=0.1$ is reached. The p-value is the area of the shaded region under the null histogram: it is the chance, assuming the null is true, of observing an outcome whose likelihood ratios tend to be large regardless of which alternative happens to be true. In particular, this construction depends intimately on the alternative hypothesis. It cannot be carried out without specifying the possible alternatives.*

8

A fair bit of this is basically covered by the first sentence of the wikipedia article on p values, which correctly defines a p-value. If that's understood, much is made clear.

– Glen_b – 2013-05-16T07:50:41.0301Just get the book: Statistics without Tears. It might save your sanity!! – None – 2014-06-20T05:18:47.230

6@user48700 Could you summarize how

Statistics Without Tearsexplains this? – Matt Krause – 2014-06-20T05:40:14.7275Someone should draw a graph of p-value related questions over time and I bet we'll see the seasonality and correlation to academic calendars in colleges or Coursera data science classes – Aksakal – 2014-12-29T23:04:19.723

In addition to other nice and relevant book recommendations in the answers and comments, I would like to suggest another book, appropriately called "What is a p-value anyway?".

– Aleksandr Blekh – 2014-12-29T23:04:25.383Without defining p-value and t-value correctly and appropriately in a specific context, the question could be misiterpretated and several possible answers may arise. For example, there is a difference between t-value and t-statistic. – subhash c. davar – 2018-01-20T01:31:32.517