What is the difference between statistical and substantive significance?

May 30, 2010

Statistical significance reflects the improbability of findings drawn from samples given certain assumptions about the null hypothesis.

Substantive significance is concerned with meaning, as in, what do the findings say about population effects themselves?

Researchers typically estimate population effects by examining representative samples. Although researchers may invest considerable effort in minimizing measurement and sampling error and thereby producing more accurate effect size estimates, ultimately the goal is a better understanding of real world effects. This distinction between real world effects and researchers’ sample-based estimates of those effects is critical to understanding the difference between statistical and substantive significance.

The statistical significance of any test result is determined by gauging the probability of getting a result at least this large if there was no underlying effect.The outcome of any test is a conditional probability or p value. If the p value falls below a conventionally accepted threshold (say .05), we might judge the result to be statistically significant.

The substantive significance of a result, in contrast, has nothing to do with the p value and everything to do with the estimated effect size. Only when we know whether we’re dealing with a large or trivial sized effect, will we be able to interpret its meaning and so speak to the substantive significance of our results. Note, though, that while the size of an effect size will be correlated with its importance, there will be plenty of occasions when even small effects may be judged important.

For more, see the brilliantly helpful ebook Effect Size Matters:

ESM_3D_no_shadow_300


How do researchers confuse statistical with substantive significance?

May 30, 2010

Researchers can confuse statistical significance with substantive significance in one of two ways:

  1. Results that are found to be statistically significant are interpreted as if they were practically meaningful. This happens when a researcher interprets a statistically significant result as being “significant” or “highly significant” in the everyday sense of the word.
  2. Results that are statistically nonsignificant are interpreted as evidence of no effect, even in the face of evidence to the contrary (e.g., a noteworthy effect size).

In some settings statistical significance will be completely unrelated to substantive significance. It is entirely possible for a result to be statistically significant and trivial or statistically nonsignificant yet important. (Click here for an example.)

Researchers get confused about these things when they misattribute meaning to p values. Remember, a p value is a confounded index. A statistically significant p could reflect either a large effect, or a large sample size, or both. Judgments about substantive significance should never be based on p values.

It is essential that researchers learn to distinguish between statistical and substantive significance. Failure to do so leads to Type I and Type II errors, wastes resources, and potentially misleads further research on the topic.

Source: The Essential Guide to Effect Sizes


Is it possible for a result to be statistically nonsignificant but substantively significant?

May 30, 2010

It is quite possible, and unfortunately quite common, for a result to be statistically significant and trivial. It is also possible for a result to be statistically nonsignificant and important.

Consider the case of a new drug that researchers hope will cure Alzheimer’s disease (Kirk 1996). They set up a trial study involving two groups each with 6 patients. One group receives the experimental treatment while the other receives a placebo. At the end of the trial they notice a 13 point improvement in the IQ of the treated group and no improvement in the control group. The drug seems to have an effect. However, the t statistic is statistically nonsignificant. The results could be a fluke. What to do?

Which of the following choices makes more sense to you:

(a) abandon the study – the result is statistically nonsignificant so the drug is ineffective
(b) conduct a larger study – a 13 point improvement seems promising

If you chose “a” you may have misinterpreted an inconclusive result as evidence of no effect. You may have confused statistical significance with substantive significance. Are you prepared to risk a Type II error when there is potentially much to be gained?

If you chose “b” then you clearly think it is possible for a result to be statistically nonsignificant yet important at the same time. You have distinguished statistical significance from substantive significance.

For more, see The Essential Guide to Effect Sizes, chapter 1.


I’m a scientist. I deal with abstract concepts. Why should I try and interpret the substantive significance of my results?

May 30, 2010

Short answer #1: we have a moral obligation to ourselves and society to do so.
Short answer #2: why do research that has no substantive value?

Consider two researchers conducting the exact same project investigating the relationship between, say, worker training and productivity. The first researcher draws his conclusions solely by looking at the p values generated by tests of statistical significance. The second researcher draws her conclusions by examining her estimates of the effect size. Judge for yourself who tells the more compelling story:

Researcher 1: “The results of my study reveal that worker training has a statistically significant and positive effect on productivity.”

Researcher 2: “The results of my study reveal that an annual $4000 investment in worker training boosts productivity by 15% leading to a net gain for a typical middle-sized firm of $1m in revenue.”

Did you spot the difference?

Researcher 1’s conclusion is yawn-inducing. He tells us nothing we couldn’t have guessed before the study was done.

Researcher 2’s conclusion, on the other, will make the front page of Managers’ Monthly. By translating her observed effect into a meaningful metric, she has made her result relevant to stakeholders beyond the academic community.

Imagine that. Imagine if we all did research that actually mattered? Then perhaps regular folks would stop making jokes about people who work in ivory towers.

For more on dealing with the challenge of interpretation, see The Essential Guide to Effect Sizes.


You say journal editors and academy presidents have called for researchers to interpret the substantive, as opposed to the statistical, significance of their results. Which editors exactly?

May 30, 2010

So far; Campbell (1982), Cummings (2007), Hambrick (1994), JEP (2003), Kendall (1997), La Greca (2005), Levant (1992), Lustig and Strauser (2004), Shaver (2006, 2008), Thompson (2002).

See also Wilkinson and the Taskforce on Statistical Inference (1999), the Publication Manual of the APA (2010, p.35), and the AERA’s Standards for Reporting (AERA 2006, p.10).

For a full list of references, click here.


Why is it a dumb idea to interpret results by looking at p values?

May 30, 2010

It is a common (but wickedly bad) practice to make judgments about a research result by looking at p values. Even in top journals you’ll sometimes see the following decision rules applied:

–    if p ≥ .10, then the result is interpreted as providing “no support” for a hypothesis
–    if .05 ≤ p < .10, then this is interpreted as providing “marginal support”
–    if p < .05, then this is interpreted as evidence “supporting” the hypothesis
–    if p < .01 or .001, this is sometimes interpreted as “strong support” or “strong confirmation” or “strong evidence”

What’s wrong with this? Everything! A p value is a confounded index so it should not be used to make judgments about effects of interest.

Imagine that we had hypothesized that X has a positive effect on Y. We collect some data, run a test and get the following result:

N = 70,000, r = .01, p < .01

Looking at the very low p value of this result we might conclude that this test revealed good evidence in support of our hypothesis. But we would be wrong. We have confused statistical with substantive significance.

Look closely at the numbers again. Note how the effect size estimate (r) is tiny, virtually zero. We have in all likelihood detected nothing of significance, just a little fluff on the proverbial lens.

So how is it that the p value is so low in this case? Because this is an overpowered test. The sample size (N) is off the scale.

How can we avoid making what is essentially a Type I error in this situation? By ignoring the p value altogether and basing our interpretation on the tiny effect size. If we did this we would most likely conclude that X has no appreciable effect on Y.

Source: The Essential Guide to Effect Sizes


Why do you say a p value is a confounded index?

May 30, 2010

Because it never turns out the way I want it, that confounded thing!

Seriously, the p value is literally a confounded index because it reflects both the size of the underlying effect and the size of the sample. Hence any information included in the p value is ambiguous (Lang et al. 1998).

Consider the following equation, which comes from Rosenthal and Rosnow (1984):

Statistical significance = Effect size x Sample size

Now let’s hold the effect size constant for a moment and consider what happens to statistical significance when we fiddle with the sample size (N). Basically, as N goes up, p will go down automatically. It has to. It has absolutely no choice. This is not a question of careful measurement or anything like that. It’s a basic mathematical equation. The bigger the sample, the more likely the result will be statistically significant, regardless of other factors.

Conversely, as N goes down, p must go up. The smaller the sample, the less likely the result will be statistically significant.

So if you happen to get a statistically significant result (a low p value), it could mean that (a) you have found something, or (b) you found nothing but your test was super-powerful because you had a large sample.

Researchers often confuse statistical significance with substantive significance. But smart researchers understand that p values should never be used to inform judgments about real world effects.

Source: The Essential Guide to Effect Sizes