Why is it a bad idea to interpret results by looking at p values?

It is a common practice to make judgments about a research result by looking at p values. Even in top journals you’ll sometimes see the following decision rules applied:

–    if p ≥ .10, then the result is interpreted as providing “no support” for a hypothesis
–    if .05 ≤ p < .10, then this is interpreted as providing “marginal support”
–    if p < .05, then this is interpreted as evidence “supporting” the hypothesis
–    if p < .01 or .001, this is sometimes interpreted as “strong support” or “strong confirmation” or “strong evidence”

What’s wrong with this? Everything! A p value is a confounded index so it cannot be used to make judgments about effects of interest.

Imagine that we had hypothesized that X has a positive effect on Y. We collect some data, run a test and get the following result:

N = 70,000, r = .01, p < .01

Looking at the very low p value of this result we might conclude that this test revealed good evidence in support of our hypothesis. But we would be wrong. We have confused statistical with substantive significance.

Look closely at the numbers again. Note how the effect size estimate (r) is tiny, virtually zero. We have in all likelihood detected nothing of significance, just a little fluff on the proverbial lens.

So how is it that the p value is so low in this case? Because this is an overpowered test. The sample size (N) is off the scale.

How can we avoid making what is essentially a Type I error in this situation? By ignoring the p value altogether and basing our interpretation on the tiny effect size. If we did this we would most likely conclude that X has no appreciable effect on Y.

Source: The Essential Guide to Effect Sizes