Uncertainty Wednesday: The Problem with P-Values (Incentives)

Last Uncertainty Wednesday, I introduced the concept of p-values. We looked at the example of a null hypothesis (explanation) that a coin is fair, observing heads (H) or tails (T) six times in a row and rejecting that the coin is fair because the probability of that happening with a fair coin is only 0.03125 which is less than 0.05 (a commonly used cutoff). I ended the post by asking readers to consider the following scenario:

You are a researcher who gets paid only if you reject the explanation of equal probability with a p-value cutoff of 0.05. How much work do you have to do to come up with a sequence of observations that gets you the desired result?

As a naive first reaction we might be tempted to say that this looks like a lot of work. After all, if there is only a 0.

image
image

probability of this happening. So we might have to run this experiment many times to get this outcome with a fair coin. Let’s say we run it 30 times in a row, then what is the probability that we would *not* get 6 in a row? Each time we do 6 flips with a fair coin the probability of getting either HHHHHH or TTTTTT is 0.03125 and hence the probability of getting something else is 1 – 0.03125 = 0.96875. So if we try that 30 times in a row, the probability of never getting HHHHHH or TTTTTT is

0.96875 ^ 30 = 0.38579

Put differently there is a 1 – 0.38579 = 0.61421 or slightly greater 60% chance that after trying 6 flips for 30 times we have gotten the result we wanted. But that is 180 flips and nowhere close to certain that we find our result. So a lot of work without a pretty uncertain payoff!

But this naive approach is far off. After all, our null hypothesis included that this coin in addition to being fair also has independent flips. So to produce an example of 6 in a row, we don’t have to do 6 flips each time. We could simply keep flipping the coin until we get 6 in a row! Let’s see how much work that is. Given that the binomial distribution (two outcomes) is relatively easily we could actually try to solve this mathematically but instead I wrote a bit of python code to simulate flipping a fair coin repeatedly until getting either HHHHHH or TTTTTT. Here is a histogram of how many flips are needed:

image

So that looks a lot more promising already from the perspective of a researcher who is intent on rejecting the null hypothesis. In fact in the example above with < 60 flips you already have a 60% chance of finding a sequence that rejects the null hypothesis. We have thus cut the work by a factor of 3.

Now what if the researcher is willing to be a bit more aggressive. Let’s say that we get HHHTHHH? What if we claim that the “T” in the middle was due to an observational error or an equipment malfunction and discard the data point? Or even more aggressively simply don’t report it in the first place? How many flips do we need now to get to “6″ heads or tails in a row? I modified my code to look for sequences that contained one “outlier” and discarded that. Here is the result:

image

Now we get to the “desired” result of rejecting the coin as fair with nearly 60% probability in on average just 40 flips.

So if your compensation depends on rejecting the null hypothesis you don’t have to do a ton of work to have a good shot at finding data that supports this conclusion on a specific hypothesis (here: that the coin was a fair coin with independent tosses). As we will see next time, the problem is even bigger than that. Hint: think about having the data first and then looking for the hypothesis.