Does “Statistical Significance” Imply “Actually Significant”?

P-values below 0.05; the finding and reporting of these values might be considered the backbone of most psychological research. Conceptually, these values are supposed to represent the notion that, if the null hypothesis is true, the odds of observing some set of results are under 5%. As such, if one observes a result unlikely to be obtained by chance, this would seem to carry the implication that the null hypothesis is unlikely to be true and there are likely real differences between the group means under examination. Despite null hypothesis significance testing becoming the standard means of statistical testing in psychology, the method is not without its flaws, both on the conceptual and practical levels. According to a paper by Simmons et al (2011), on the practical end of things, some of the ways in which researchers are able to selectively collect and analyze data can dramatically inflate the odds of obtaining a statistically significant result.

Don’t worry though; it probably won’t blow up in your face until much later in your career.

Before getting to their paper, it’s worth covering some of the conceptual issues inherent with null hypothesis significance testing, as the practical issues can be said to apply just as well to other kinds of statistical testing. Brandstaetter (1999) raises two large concerns about null hypothesis significance testing, though really they’re more like two parts of the same concern, and, ironically enough, almost sound as if they’re opposing points. The first part of this concern is that classic significance testing does not tell us whether the results we observed came from a sample with a mean that was actually different from the null hypothesis. In other words, a statistically significant result does not tell us that the null hypothesis is false; in fact, it doesn’t even tell us the null hypothesis is unlikely. According to Brandstaetter (1999), this is due to the logic underlying significance testing being invalid. The specific example that Brandstaetter uses references the rolling of dice: if you roll a twenty-sided die, it’s unlikely (5%) that you will observe a 1; however, if you observe a 1, it doesn’t follow that it’s unlikely you rolled the die.

While that example addresses null hypothesis testing at a strictly logical level, this objection can be dealt with fairly easily, I feel: in Brandstaetter’s example, the hypothesis that one would be testing is not “the die was rolled”, so that specific example seems a bit strange. If you were comparing the heights of two different groups (say, men and women), and you found one group was, in your sample, an average of six inches, it might be reasonable to conclude that it’s unlikely that the population means that the two samples come from are the same. This is where the second part of the criticism comes into play: in reality, the means of different groups are almost guaranteed to be different in some way, no matter how small or large that difference is. This means that, strictly speaking, the null hypothesis (there is no mean difference) is pretty much always false; the matter then becomes whether your test has enough power to reach statistical significance, and increasing your sample size can generally do the trick in that regard. So, in addition to not telling us about whether the null hypothesis is true or false, the best that this kind of significance testing can do is tell us a specific value that a population mean is not. However, since there are an infinite number of possible values that a population mean could hypothetically take, the value of this information may be minimal.

Even in the best of times, then, significance testing has some rather definite conceptual concerns. These two conceptual issues, however, seem to be overshadowed in importance by that practical issues that arise during the conducting of research; what Simmons et al (2011) call “researcher degrees of freedom”. This term is designed to capture some of the various decisions that researchers might make over the course of collecting and analyzing data while hunting for statistically significant results capable of being published. As publications are important for any researcher’s career, and statistically significant results are the kind that are most likely to be published (or so I’ve been told), this combination of pressures can lead to researchers making choices – albeit not typically malicious ones – that increase their chances of finding such results.

“There’s a significant p-value in this mountain of data somewhere, I tell you!”

Simmons et al (2011) began by generating random samples all pulled from a normal distribution across 15,000 independent simulations. Since they were testing for how often statistically significant effects were found, if they were using classic significance testing, that rate should not tend to exceed 5%. When there were two dependent measures capable of being analyzed (in their example, these were willingness to pay and liking), the ability to analyze these two measures separately or in combination nearly doubled the chances of finding a statistically significant “effect” at the 0.05 level. That is to say, the odds of finding an effect by chance were no longer 5%, but closer to 10%. A similar effect was found given the researchers controlled for gender. This makes intuitive sense, as it’s basically the same manipulation as the former two-measure case, just with a different label.

There’s similar bad news for the peak-and-test method that some researchers make use of with their data. In these cases, a researcher will collect some number of subjects for each condition – say 20 – and conduct a test to see if they found an effect. If an effect is found, the researcher will stop collecting data; if the effect isn’t found, the researcher will then collect another number of observations per condition – say another 10 – and then retest for significance. A researcher’s ability to peak at their data increased the odds of finding an effect by chance up to about 8%. Finally, if the researcher decides to run multiple levels of a condition (Simmons et al’s example concerned splitting the sample into low, medium, and high conditions), the ability to selectively compare these conditions to each other brought the false positive rate up to 12.6%. Worrying, if these four degrees of researcher freedom were combined, the odds of finding a false positive were as high as 60%; that is, the odds are better that you would find some effect strictly by chance than you wouldn’t. While these results might have been statistically significant, they are not actually significant. This is a fine example of Brandstaetter’s (1999) initial point: significance testing does not tell us that the null hypothesis is true or likely, as it should have been in all these cases.

As Simmons et al (2011) also note, this rate of false positives might even be conservative, given that there are other, unconsidered liberties that researchers can take. Making matters even worse, there’s the aforementioned publication bias, in that, at least as far as I’ve been led to believe, journals tend to favor publications that (a) find statistically significant results and (b) are novel in their design (i.e. journals tend to not publish replications). This means that when false positives are found, they’re both more likely to make their way into journals and less likely to subsequently be corrected. In turn, those false positives could lead to poor research outcomes, such as researchers wasting time and money chasing effects that are unlikely to be found again, or actually reinforcing the initial false-positive in the event they go chasing after it, it actually is found by chance, and subsequently published again.

“With such a solid foundation, it’s difficult to see how this could have happened”

Simmons et al (2011) do put forth some suggestions as to how these problems could begin to be remedied. While I think their suggestions are all, in the abstract, good ideas, they would likely also generate a good deal more paperwork for researchers to deal with, and I don’t know a researcher alive who craves more paperwork. While there might be some tradeoff, in this case, between some amount of paperwork and eventual research quality, there is one point that Simmons et al (2011) do not discuss when it comes to remedying this issue, and that’s the matter I have been writing about for some time: the inclusion of theory in research. In my experience, a typical paper in psychology will give one of two explicit reasons for its being conducted: (1) an effect was found previously, so the researchers are looking to either find it again (or not find it), or (2) the authors have a hunch they will find an effect. Without an real theoretical framework surrounding these research projects, there is little need to make sense of or actually explain a finding; one can simply say they discovered a “bias” or a “cognitive blindness” and leave it at that. While I can’t say how much of the false-positive problem could be dealt with by requiring the inclusion of some theoretical framework for understanding one’s results when submitting a manuscript, if any, I feel some theory requirement would still go a long way towards improving the quality of research that ends up getting published. It would encourage researchers to think more deeply about why they’re doing what they’re doing, as well as help readers to understand (and critique) the results they end up seeing. While dealing with false positives should certainly be a concern, merely cutting down on their appearance is not be enough to help research quality in psychology progress appreciably.

References: Brandstaetter (1999). Confidence intervals as an alternative to significance testing. Methods of Psychological Researcher Online, 4.

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

Leigh Caldwell on November 20, 2012 at 2:51 pm said:

One of the authors, Simonsohn, also just presented at the SJDM conference a clever method for detecting what he calls “p-hacking”, the collective name for the p-value-reducing practices mentioned above.

It doesn’t work on single papers but can be used across all the papers in a group: say, all papers on the endowment effect or all by a particular author. The technique simply requires drawing a histogram of all the p-values in all the papers.

If the papers are describing real effects and conducted correctly, we would expect to see more p=0.01 than p=0.02 and more p=0.03 than p=0.05 – the histogram will be skewed towards zero. If there is no effect and the results are purely due to chance, we’d see a flat graph. And if there is a lot of p-hacking, we’d see it skewed the other way, towards the 0.05 end.

He analysed a couple of groups of papers (based on specific keyword criteria) and found that p-hacking could indeed be detected in certain bodies of work. Clever technique and very practical at the meta-level to see whether the overall research in a particular field can be relied on.

Pingback: The Sometimes Significant Effects Of Sexism | Pop Psychology

Pingback: Statisticial Issues In Psychology And What Not To Do About Them | Pop Psychology

Pop Psychology

The Internet's Best Evolutionary Psycholo-guy

Does “Statistical Significance” Imply “Actually Significant”?

3 comments on “Does “Statistical Significance” Imply “Actually Significant”?”