Why can’t I just judge my result by looking at the p value?

Because a low p value could reflect any number of things apart from the size of the underlying effect.

Consider two hypothetical studies examining the relationship between exam marking and academic happiness. Both studies used identical measures and procedures and generated the following results:

Study 1: N = 62, r = -.25, p > .05

Study 2: N = 63, r = -.25, p < .05

In the first study the results were found to be statistically nonsignificant (p > .05) leading the authors to conclude that exam marking has no effect on academic happiness. However, the results of the second study were found to be statistically significant (p < .05) leading the authors of Study 2 to conclude that marking adversely affects happiness.

But here’s the thing: in both studies the authors made identical estimates of the effect size (r = -.25). Both studies essentially came up with the exact same results. The conclusion that we should take away from either study is that marking has a negative effect on happiness equivalent to r = -.25.

So how is it that the authors of Study 1 reached a different conclusion?

They screwed up basically. The authors of Study 1 ignored their effect size estimate and examined only the p value associated with their test statistic. They incorrectly interpreted a statistically nonsignificant result as indicating no effect. A nonsignificant result is more accurately interpreted as an inconclusive result. There might be no effect or there might be an effect which went undetected because the study lacked statistical power.

In this example the only real difference between the two studies was that the second study had one more observation and consequently just enough statistical power to push the result across the threshold of statistical significance. In other words, sample size, rather than the effect size, explained the different conclusions drawn.

You should never judge the substantive significance of a result by looking at a p value. P values are confounded indexes and are no substitute for estimates of the effect size.

In this hypothetical example, both sets of authors would have arrived at the same conclusion if both had ignored their p values and focused on their correlation coefficients.

This entry was posted on Monday, May 31st, 2010 at 2:14 am and is filed under effect size, interpreting results, p values. You can follow any responses to this entry through the RSS 2.0 feed.
Both comments and pings are currently closed.

“The primary product of a research inquiry is one or more measures of effect size, not p values.”
~ Jacob Cohen

How to manuals

“Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude – not just, does a treatment affect people, but how much does it affect them.”
~ Gene Glass