Can you give me some examples of an effect size?

May 31, 2010

Examples of effect sizes are all around us. Consider the following claims which you might find advertised in your newspaper:

–         “enjoy immediate pain relief through acupuncture”

–         “change service providers now and save 30%”

–         “look 10 years younger with Botox”

Notice how each claim promises an effect (“look younger with Botox”) of measureable size (“10 years younger”). No understanding of statistical significance is necessary to gauge the merits of each claim. Each effect is being promoted as if it were intrinsically meaningful. (Whether it is or not is up to the newspaper reader to decide.)

Many of our daily decisions are based on some analysis of effect size. We sign up for courses that we believe will enhance our career prospects. We cut back on carbohydrates to lose weight. We stop at red lights to reduce the risk of accidents. We buy stock we believe will appreciate in value. We take an umbrella if we perceive a high chance of rain.

The interpretation of effect sizes is how we make sense of the world.

In this sense researchers are no different from anybody else. Where researchers do differ is in the care taken to generate accurate effect size estimates. But while we may spend a lot of our time looking for ways to reduce sampling and measurement error, among other things, ultimately our goal is a better understanding of real world effects.

And this is why it is essential that we interpret not only the statistical significance of our results but their real world or substantive significance as well.

For more on how to do this, check out the e-book Effect Size Matters.

ESM_3D_no_shadow_300


Why can’t I just judge my result by looking at the p value?

May 31, 2010

Because a low p value could reflect any number of things apart from the size of the underlying effect.

Consider two hypothetical studies examining the relationship between exam marking and academic happiness. Both studies used identical measures and procedures and generated the following results:

Study 1: N = 62, r = -.25, p > .05

Study 2: N = 63, r = -.25, p < .05

In the first study the results were found to be statistically nonsignificant (p > .05) leading the authors to conclude that exam marking has no effect on academic happiness. However, the results of the second study were found to be statistically significant (p < .05) leading the authors of Study 2 to conclude that marking adversely affects happiness.

But here’s the thing: in both studies the authors made identical estimates of the effect size (r = -.25). Both studies essentially came up with the exact same results. The conclusion that we should take away from either study is that marking has a negative effect on happiness equivalent to r = -.25.

So how is it that the authors of Study 1 reached a different conclusion?

They screwed up basically. The authors of Study 1 ignored their effect size estimate and examined only the p value associated with their test statistic. They incorrectly interpreted a statistically nonsignificant result as indicating no effect. A nonsignificant result is more accurately interpreted as an inconclusive result. There might be no effect or there might be an effect which went undetected because the study lacked statistical power.

In this example the only real difference between the two studies was that the second study had one more observation and consequently just enough statistical power to push the result across the threshold of statistical significance. In other words, sample size, rather than the effect size, explained the different conclusions drawn.

You should never judge the substantive significance of a result by looking at a p value. P values are confounded indexes and are no substitute for estimates of the effect size.

In this hypothetical example, both sets of authors would have arrived at the same conclusion if both had ignored their p values and focused on their correlation coefficients.

Source: The Essential Guide to Effect Sizes


What is the difference between statistical and substantive significance?

May 30, 2010

Statistical significance reflects the improbability of findings drawn from samples given certain assumptions about the null hypothesis.

Substantive significance is concerned with meaning, as in, what do the findings say about population effects themselves?

Researchers typically estimate population effects by examining representative samples. Although researchers may invest considerable effort in minimizing measurement and sampling error and thereby producing more accurate effect size estimates, ultimately the goal is a better understanding of real world effects. This distinction between real world effects and researchers’ sample-based estimates of those effects is critical to understanding the difference between statistical and substantive significance.

The statistical significance of any test result is determined by gauging the probability of getting a result at least this large if there was no underlying effect.The outcome of any test is a conditional probability or p value. If the p value falls below a conventionally accepted threshold (say .05), we might judge the result to be statistically significant.

The substantive significance of a result, in contrast, has nothing to do with the p value and everything to do with the estimated effect size. Only when we know whether we’re dealing with a large or trivial sized effect, will we be able to interpret its meaning and so speak to the substantive significance of our results. Note, though, that while the size of an effect size will be correlated with its importance, there will be plenty of occasions when even small effects may be judged important.

For more, see the brilliantly helpful ebook Effect Size Matters:

ESM_3D_no_shadow_300


How do researchers confuse statistical with substantive significance?

May 30, 2010

Researchers can confuse statistical significance with substantive significance in one of two ways:

  1. Results that are found to be statistically significant are interpreted as if they were practically meaningful. This happens when a researcher interprets a statistically significant result as being “significant” or “highly significant” in the everyday sense of the word.
  2. Results that are statistically nonsignificant are interpreted as evidence of no effect, even in the face of evidence to the contrary (e.g., a noteworthy effect size).

In some settings statistical significance will be completely unrelated to substantive significance. It is entirely possible for a result to be statistically significant and trivial or statistically nonsignificant yet important. (Click here for an example.)

Researchers get confused about these things when they misattribute meaning to p values. Remember, a p value is a confounded index. A statistically significant p could reflect either a large effect, or a large sample size, or both. Judgments about substantive significance should never be based on p values.

It is essential that researchers learn to distinguish between statistical and substantive significance. Failure to do so leads to Type I and Type II errors, wastes resources, and potentially misleads further research on the topic.

Source: The Essential Guide to Effect Sizes


Is it possible for a result to be statistically nonsignificant but substantively significant?

May 30, 2010

It is quite possible, and unfortunately quite common, for a result to be statistically significant and trivial. It is also possible for a result to be statistically nonsignificant and important.

Consider the case of a new drug that researchers hope will cure Alzheimer’s disease (Kirk 1996). They set up a trial study involving two groups each with 6 patients. One group receives the experimental treatment while the other receives a placebo. At the end of the trial they notice a 13 point improvement in the IQ of the treated group and no improvement in the control group. The drug seems to have an effect. However, the t statistic is statistically nonsignificant. The results could be a fluke. What to do?

Which of the following choices makes more sense to you:

(a) abandon the study – the result is statistically nonsignificant so the drug is ineffective
(b) conduct a larger study – a 13 point improvement seems promising

If you chose “a” you may have misinterpreted an inconclusive result as evidence of no effect. You may have confused statistical significance with substantive significance. Are you prepared to risk a Type II error when there is potentially much to be gained?

If you chose “b” then you clearly think it is possible for a result to be statistically nonsignificant yet important at the same time. You have distinguished statistical significance from substantive significance.

For more, see The Essential Guide to Effect Sizes, chapter 1.


I’m a scientist. I deal with abstract concepts. Why should I try and interpret the substantive significance of my results?

May 30, 2010

Short answer #1: we have a moral obligation to ourselves and society to do so.
Short answer #2: why do research that has no substantive value?

Consider two researchers conducting the exact same project investigating the relationship between, say, worker training and productivity. The first researcher draws his conclusions solely by looking at the p values generated by tests of statistical significance. The second researcher draws her conclusions by examining her estimates of the effect size. Judge for yourself who tells the more compelling story:

Researcher 1: “The results of my study reveal that worker training has a statistically significant and positive effect on productivity.”

Researcher 2: “The results of my study reveal that an annual $4000 investment in worker training boosts productivity by 15% leading to a net gain for a typical middle-sized firm of $1m in revenue.”

Did you spot the difference?

Researcher 1’s conclusion is yawn-inducing. He tells us nothing we couldn’t have guessed before the study was done.

Researcher 2’s conclusion, on the other, will make the front page of Managers’ Monthly. By translating her observed effect into a meaningful metric, she has made her result relevant to stakeholders beyond the academic community.

Imagine that. Imagine if we all did research that actually mattered? Then perhaps regular folks would stop making jokes about people who work in ivory towers.

For more on dealing with the challenge of interpretation, see The Essential Guide to Effect Sizes.


You say journal editors and academy presidents have called for researchers to interpret the substantive, as opposed to the statistical, significance of their results. Which editors exactly?

May 30, 2010

So far; Campbell (1982), Cummings (2007), Hambrick (1994), JEP (2003), Kendall (1997), La Greca (2005), Levant (1992), Lustig and Strauser (2004), Shaver (2006, 2008), Thompson (2002).

See also Wilkinson and the Taskforce on Statistical Inference (1999), the Publication Manual of the APA (2010, p.35), and the AERA’s Standards for Reporting (AERA 2006, p.10).

For a full list of references, click here.