Why is it a dumb idea to interpret results by looking at p values?

May 30, 2010

It is a common (but wickedly bad) practice to make judgments about a research result by looking at p values. Even in top journals you’ll sometimes see the following decision rules applied:

–    if p ≥ .10, then the result is interpreted as providing “no support” for a hypothesis
–    if .05 ≤ p < .10, then this is interpreted as providing “marginal support”
–    if p < .05, then this is interpreted as evidence “supporting” the hypothesis
–    if p < .01 or .001, this is sometimes interpreted as “strong support” or “strong confirmation” or “strong evidence”

What’s wrong with this? Everything! A p value is a confounded index so it should not be used to make judgments about effects of interest.

Imagine that we had hypothesized that X has a positive effect on Y. We collect some data, run a test and get the following result:

N = 70,000, r = .01, p < .01

Looking at the very low p value of this result we might conclude that this test revealed good evidence in support of our hypothesis. But we would be wrong. We have confused statistical with substantive significance.

Look closely at the numbers again. Note how the effect size estimate (r) is tiny, virtually zero. We have in all likelihood detected nothing of significance, just a little fluff on the proverbial lens.

So how is it that the p value is so low in this case? Because this is an overpowered test. The sample size (N) is off the scale.

How can we avoid making what is essentially a Type I error in this situation? By ignoring the p value altogether and basing our interpretation on the tiny effect size. If we did this we would most likely conclude that X has no appreciable effect on Y.

Source: The Essential Guide to Effect Sizes

What are some conventions for interpreting different effect sizes?

May 30, 2010

Say you’ve got an effect size equivalent to r = .25. What does it mean? How do you interpret this effect size? Ideally you will be able to contextualize this effect against some meaningful frame of reference. But if that’s not possible another approach is to refer to conventions such as those developed by Jacob Cohen.

In his authoritative Statistical Power Analysis for the Behavioral Sciences, Cohen (1988) outlined a number of criteria for gauging small, medium and large effect sizes in different metrics, as follows:

r effects: small ≥ .10, medium ≥ .30, large ≥ .50

d effects: small ≥ .20, medium ≥ .50, large ≥ .80

According to Cohen, an effect size equivalent to r = .25 would qualify as small in size because it’s bigger than the minimum threshold of .10, but smaller than the cut-off of .30 required for a medium sized effect. So what can we say about r = .25? It’s small, and that’s about it.

Cohen’s conventions are easy to use. You just compare your estimate with his thresholds and get a ready-made interpretation of your result. (For a fun illustration of this, check out the infamous Result Whacker.)

But Cohen’s conventions are somewhat arbitrary and it is not difficult to conceive of situations where a small effect observed in one setting might be considered more important than a large effect observed in another. As always, context matters when interpreting results.

For more on interpreting effect sizes, see Effect Size Matters:


When are small effects important?

May 30, 2010

Small effects can be very important in the right context.

In sport, a small effect size may be the difference between winning and losing. Just ask American swimmer Dara Torres. She attributed missing out on winning the gold medal in the 50m freestyle at the Beijing Olympics to having filed her fingernails the previous night.

Small effects may be considered meaningful if they trigger big consequences, if they change the perceived probability that larger outcomes might occur, or if they accumulate into larger effects.

For more on the significance of small effects, see The Essential Guide to Effect Sizes, chapter 2.

What is the “curse of multiplicity”?

May 30, 2010

The multiplicity problem arises when studies report the results of multiple statistical tests raising the probability that at least some of the results will be found to be statistically significant even if there is no underlying effect. Run enough tests and you will eventually find something, even when there’s nothing to be found.

In their recommendations to the APA, Wilkinson and the Taskforce on Statistical Inference (1999: 599) described the multiplicity problem as “the curse of the social sciences.” Consider a study that tests 14 null hypotheses all of which happen to be true (meaning there are no underlying effects). If each test is assessed according to a conventional alpha level of .05, the odds are better than even that one of the hypotheses will be found to be statistically significant purely as a result of chance. Crunch the data 45 different ways and the probability that one result will turn out to be statistically significant rises to 90%.

Where do these numbers come from?

If N independent tests are examined for statistical significance, and all of the individual null hypotheses are true, then the probability that at least one of them will be found to be statistically significant is equal to 1 – (1 – α)N, for any given level of alpha (α).

If the critical alpha level for a single test is set at .05, this means the probability of erroneously attributing statistical significance to a result when the null is true is .05. But if two or three tests are run, the probability of achieving at least one statistically significant result rises to .10 and .14 respectively. For a study reporting 14 tests, the probability that at least one result will be found to be statistically significant is 1 – (1 – .05)14 = .51.

What’s the solution?

The standard cure for the multiplicity problem is to adjust alpha levels to account for the large number of tests being run on the data. One way to do this is to apply the Bonferroni correction of α/N where α represents the critical test level that would have been applied if only one hypothesis was being tested and N represents the number of tests being run on the same set of data. For more sophisticated variations of this remedy, see Keppel (1982, chapter 8).

Personally I don’t think adjusting alpha is the way to go. Adjusting alpha to compensate for the familywise error rate inevitably drains statistical power making it harder to assign statistical significance to both chance fluctuations and genuine effects.

A better solution is to resist the temptation to test for effects other than the few you were actually looking for in the first place. Fishing around in the data in the hopes of finding a statistically significant result is not good science.

For more on the dangers of fishing (and HARKing, or Hypothesizing After the Results are Known), see The Essential Guide to Effect Sizeschapter 4.