Because a low *p* value could reflect any number of things apart from the size of the underlying effect.

Consider two hypothetical studies examining the relationship between exam marking and academic happiness. Both studies used identical measures and procedures and generated the following results:

Study 1: *N* = 62, *r* = -.25, *p* > .05

Study 2: *N* = 63, *r* = -.25, *p* < .05

In the first study the results were found to be statistically nonsignificant (*p* > .05) leading the authors to conclude that exam marking has no effect on academic happiness. However, the results of the second study were found to be statistically significant (*p* < .05) leading the authors of Study 2 to conclude that marking adversely affects happiness.

But here’s the thing: in both studies the authors made *identical* estimates of the **effect size** (*r* = -.25). Both studies essentially came up with the exact same results. The conclusion that we should take away from either study is that marking has a negative effect on happiness equivalent to *r* = -.25.

So how is it that the authors of Study 1 reached a different conclusion?

They screwed up basically. The authors of Study 1 ignored their effect size estimate and examined only the *p* value associated with their test statistic. They incorrectly interpreted a statistically nonsignificant result as indicating no effect. A nonsignificant result is more accurately interpreted as an inconclusive result. There might be no effect or there might be an effect which went undetected because the study lacked **statistical power**.

In this example the only real difference between the two studies was that the second study had one more observation and consequently just enough statistical power to push the result across the threshold of **statistical significance**. In other words, sample size, rather than the effect size, explained the different conclusions drawn.

You should never judge the **substantive significance** of a result by looking at a *p* value. *P* values are confounded indexes and are no substitute for estimates of the effect size.

In this hypothetical example, both sets of authors would have arrived at the same conclusion if both had ignored their *p* values and focused on their correlation coefficients.

*Source: The Essential Guide to Effect Sizes*