The **multiplicity problem** arises when studies report the results of multiple statistical tests raising the probability that at least some of the results will be found to be statistically significant even if there is no underlying effect. Run enough tests and you will eventually find something, even when there’s nothing to be found.

In their recommendations to the APA, Wilkinson and the Taskforce on Statistical Inference (1999: 599) described the multiplicity problem as “the curse of the social sciences.” Consider a study that tests 14 null hypotheses all of which happen to be true (meaning there are no underlying effects). If each test is assessed according to a conventional alpha level of .05, the odds are better than even that one of the hypotheses will be found to be statistically significant purely as a result of chance. Crunch the data 45 different ways and the probability that one result will turn out to be statistically significant rises to 90%.

Where do these numbers come from?

If *N* independent tests are examined for statistical significance, and all of the individual null hypotheses are true, then the probability that at least one of them will be found to be statistically significant is equal to 1 – (1 – *α*)* ^{N}*, for any given level of alpha (

*α*).

If the critical alpha level for a single test is set at .05, this means the probability of erroneously attributing statistical significance to a result when the null is true is .05. But if two or three tests are run, the probability of achieving at least one statistically significant result rises to .10 and .14 respectively. For a study reporting 14 tests, the probability that at least one result will be found to be statistically significant is 1 – (1 – .05)^{14 } = .51.

What’s the solution?

The standard cure for the multiplicity problem is to adjust alpha levels to account for the large number of tests being run on the data. One way to do this is to apply the Bonferroni correction of α/*N* where α represents the critical test level that would have been applied if only one hypothesis was being tested and *N* represents the number of tests being run on the same set of data. For more sophisticated variations of this remedy, see Keppel (1982, chapter 8).

Personally I don’t think adjusting alpha is the way to go. Adjusting alpha to compensate for the familywise error rate inevitably drains statistical power making it harder to assign statistical significance to both chance fluctuations and genuine effects.

A better solution is to resist the temptation to test for effects other than the few you were actually looking for in the first place. Aimlessly fishing around in the data in the hopes of finding a statistically significant result is not good science.