Why are journal editors increasingly asking authors to report effect sizes?

May 31, 2010

Because the whole point of doing research is that we may learn something about real world effects.

Editors are increasingly asking for authors to provide their effect size estimates because of the growing realization that tests of statistical significance don’t tell us what we really want to know. As Cohen (1990: 1310) famously said:

“The primary product of a research inquiry is one or more measures of effect size, not p values.”

In the bad old days, researchers looked at their p values to see whether their hypotheses were supported. Get a low p value and, voila!, you had a result. But p values are confounded indexes that actually tell us very little about the phenomena we study. At best, they tell us the direction of an effect, but they don’t tell us how big it is. And if we can’t say whether the effect is large or trivial in size, how can we interpret our result?

The estimation of effect sizes is essential to the interpretation of a study’s results. In the fifth edition of its Publication Manual, the American Psychological Association or APA identified the “failure to report effect sizes” as one of seven common defects editors observed in submitted manuscripts. To help readers understand the importance of a study’s findings, authors were advised that “it is almost always necessary to include some index of effect” (APA 2001: 25).

Many editors have made similar calls thus it is increasingly common for submission guidelines to either encourage or mandate the reporting of effect sizes.

Why can’t I just judge my result by looking at the p value?

May 31, 2010

Because a low p value could reflect any number of things apart from the size of the underlying effect.

Consider two hypothetical studies examining the relationship between exam marking and academic happiness. Both studies used identical measures and procedures and generated the following results:

Study 1: N = 62, r = -.25, p > .05

Study 2: N = 63, r = -.25, p < .05

In the first study the results were found to be statistically nonsignificant (p > .05) leading the authors to conclude that exam marking has no effect on academic happiness. However, the results of the second study were found to be statistically significant (p < .05) leading the authors of Study 2 to conclude that marking adversely affects happiness.

But here’s the thing: in both studies the authors made identical estimates of the effect size (r = -.25). Both studies essentially came up with the exact same results. The conclusion that we should take away from either study is that marking has a negative effect on happiness equivalent to r = -.25.

So how is it that the authors of Study 1 reached a different conclusion?

They screwed up basically. The authors of Study 1 ignored their effect size estimate and examined only the p value associated with their test statistic. They incorrectly interpreted a statistically nonsignificant result as indicating no effect. A nonsignificant result is more accurately interpreted as an inconclusive result. There might be no effect or there might be an effect which went undetected because the study lacked statistical power.

In this example the only real difference between the two studies was that the second study had one more observation and consequently just enough statistical power to push the result across the threshold of statistical significance. In other words, sample size, rather than the effect size, explained the different conclusions drawn.

You should never judge the substantive significance of a result by looking at a p value. P values are confounded indexes and are no substitute for estimates of the effect size.

In this hypothetical example, both sets of authors would have arrived at the same conclusion if both had ignored their p values and focused on their correlation coefficients.

Source: The Essential Guide to Effect Sizes

It’s been a long time since I studied statistics. Remind me, what does a p value represent?

May 30, 2010

A common misperception is that p = .05 means there is a 5% probability of obtaining the observed result by chance. The correct interpretation is that there is a 5% probability of getting a result this large (or larger) if the effect size equals zero.

A p value is the answer to the question: if the null hypothesis were true, how likely is this result? A low p says “highly unlikely”, making the null improbable and therefore rejectable.

In substantive terms, a p value really tells us very little. As it is a confounded index, it is not a good idea to interpret results on the basis of p values.

For better ways of interpreting results, see this.

Why is it a dumb idea to interpret results by looking at p values?

May 30, 2010

It is a common (but wickedly bad) practice to make judgments about a research result by looking at p values. Even in top journals you’ll sometimes see the following decision rules applied:

–    if p ≥ .10, then the result is interpreted as providing “no support” for a hypothesis
–    if .05 ≤ p < .10, then this is interpreted as providing “marginal support”
–    if p < .05, then this is interpreted as evidence “supporting” the hypothesis
–    if p < .01 or .001, this is sometimes interpreted as “strong support” or “strong confirmation” or “strong evidence”

What’s wrong with this? Everything! A p value is a confounded index so it should not be used to make judgments about effects of interest.

Imagine that we had hypothesized that X has a positive effect on Y. We collect some data, run a test and get the following result:

N = 70,000, r = .01, p < .01

Looking at the very low p value of this result we might conclude that this test revealed good evidence in support of our hypothesis. But we would be wrong. We have confused statistical with substantive significance.

Look closely at the numbers again. Note how the effect size estimate (r) is tiny, virtually zero. We have in all likelihood detected nothing of significance, just a little fluff on the proverbial lens.

So how is it that the p value is so low in this case? Because this is an overpowered test. The sample size (N) is off the scale.

How can we avoid making what is essentially a Type I error in this situation? By ignoring the p value altogether and basing our interpretation on the tiny effect size. If we did this we would most likely conclude that X has no appreciable effect on Y.

Source: The Essential Guide to Effect Sizes

Why do you say a p value is a confounded index?

May 30, 2010

Because it never turns out the way I want it, that confounded thing!

Seriously, the p value is literally a confounded index because it reflects both the size of the underlying effect and the size of the sample. Hence any information included in the p value is ambiguous (Lang et al. 1998).

Consider the following equation, which comes from Rosenthal and Rosnow (1984):

Statistical significance = Effect size x Sample size

Now let’s hold the effect size constant for a moment and consider what happens to statistical significance when we fiddle with the sample size (N). Basically, as N goes up, p will go down automatically. It has to. It has absolutely no choice. This is not a question of careful measurement or anything like that. It’s a basic mathematical equation. The bigger the sample, the more likely the result will be statistically significant, regardless of other factors.

Conversely, as N goes down, p must go up. The smaller the sample, the less likely the result will be statistically significant.

So if you happen to get a statistically significant result (a low p value), it could mean that (a) you have found something, or (b) you found nothing but your test was super-powerful because you had a large sample.

Researchers often confuse statistical significance with substantive significance. But smart researchers understand that p values should never be used to inform judgments about real world effects.

Source: The Essential Guide to Effect Sizes