Does “Statistical Significance” Imply “Actually Significant”?

P-values below 0.05; the finding and reporting of these values might be considered the backbone of most psychological research. Conceptually, these values are supposed to represent the notion that, if the null hypothesis is true, the odds of observing some set of results are under 5%. As such, if one observes a result unlikely to be obtained by chance, this would seem to carry the implication that the null hypothesis is unlikely to be true and there are likely real differences between the group means under examination. Despite null hypothesis significance testing becoming the standard means of statistical testing in psychology, the method is not without its flaws, both on the conceptual and practical levels. According to a paper by Simmons et al (2011), on the practical end of things, some of the ways in which researchers are able to selectively collect and analyze data can dramatically inflate the odds of obtaining a statistically significant result.

Don’t worry though; it probably won’t blow up in your face until much later in your career.

Before getting to their paper, it’s worth covering some of the conceptual issues inherent with null hypothesis significance testing, as the practical issues can be said to apply just as well to other kinds of statistical testing. Brandstaetter (1999) raises two large concerns about null hypothesis significance testing, though really they’re more like two parts of the same concern, and, ironically enough, almost sound as if they’re opposing points. The first part of this concern is that classic significance testing does not tell us whether the results we observed came from a sample with a mean that was actually different from the null hypothesis. In other words, a statistically significant result does not tell us that the null hypothesis is false; in fact, it doesn’t even tell us the null hypothesis is unlikely. According to Brandstaetter (1999), this is due to the logic underlying significance testing being invalid. The specific example that Brandstaetter uses references the rolling of dice: if you roll a twenty-sided die, it’s unlikely (5%) that you will observe a 1; however, if you observe a 1, it doesn’t follow that it’s unlikely you rolled the die.

While that example addresses null hypothesis testing at a strictly logical level, this objection can be dealt with fairly easily, I feel: in Brandstaetter’s example, the hypothesis that one would be testing is not “the die was rolled”, so that specific example seems a bit strange. If you were comparing the heights of two different groups (say, men and women), and you found one group was, in your sample, an average of six inches, it might be reasonable to conclude that it’s unlikely that the population means that the two samples come from are the same. This is where the second part of the criticism comes into play: in reality, the means of different groups are almost guaranteed to be different in some way, no matter how small or large that difference is. This means that, strictly speaking, the null hypothesis (there is no mean difference) is pretty much always false; the matter then becomes whether your test has enough power to reach statistical significance, and increasing your sample size can generally do the trick in that regard. So, in addition to not telling us about whether the null hypothesis is true or false, the best that this kind of significance testing can do is tell us a specific value  that a population mean is not. However, since there are an infinite number of possible values that a population mean could hypothetically take, the value of this information may be minimal.

Even in the best of times, then, significance testing has some rather definite conceptual concerns. These two conceptual issues, however, seem to be overshadowed in importance by that practical issues that arise during the conducting of research; what Simmons et al (2011) call “researcher degrees of freedom”. This term is designed to capture some of the various decisions that researchers might make over the course of collecting and analyzing data while hunting for statistically significant results capable of being published. As publications are important for any researcher’s career, and statistically significant results are the kind that are most likely to be published (or so I’ve been told), this combination of pressures can lead to researchers making choices – albeit not typically malicious ones – that increase their chances of finding such results.

“There’s a significant p-value in this mountain of data somewhere, I tell you!”

Simmons et al (2011) began by generating random samples all pulled from a normal distribution across 15,000 independent simulations. Since they were testing for how often statistically significant effects were found, if they were using classic significance testing, that rate should not tend to exceed 5%. When there were two dependent measures capable of being analyzed (in their example, these were willingness to pay and liking), the ability to analyze these two measures separately or in combination nearly doubled the chances of finding a statistically significant “effect” at the 0.05 level. That is to say, the odds of finding an effect by chance were no longer 5%, but closer to 10%. A similar effect was found given the researchers controlled for gender. This makes intuitive sense, as it’s basically the same manipulation as the former two-measure case, just with a different label.

There’s similar bad news for the peak-and-test method that some researchers make use of with their data. In these cases, a researcher will collect some number of subjects for each condition – say 20 – and conduct a test to see if they found an effect. If an effect is found, the researcher will stop collecting data; if the effect isn’t found, the researcher will then collect another number of observations per condition – say another 10 – and then retest for significance. A researcher’s ability to peak at their data increased the odds of finding an effect by chance up to about 8%. Finally, if the researcher decides to run multiple levels of a condition (Simmons et al’s example concerned splitting the sample into low, medium, and high conditions), the ability to selectively compare these conditions to each other brought the false positive rate up to 12.6%. Worrying, if these four degrees of researcher freedom were combined, the odds of finding a false positive were as high as 60%; that is, the odds are better that you would find some effect strictly by chance than you wouldn’t. While these results might have been statistically significant, they are not actually significant. This is a fine example of Brandstaetter’s (1999) initial point: significance testing does not tell us that the null hypothesis is true or likely, as it should have been in all these cases.

As Simmons et al (2011) also note, this rate of false positives might even be conservative, given that there are other, unconsidered liberties that researchers can take. Making matters even worse, there’s the aforementioned publication bias, in that, at least as far as I’ve been led to believe, journals tend to favor publications that (a) find statistically significant results and (b) are novel in their design (i.e. journals tend to not publish replications). This means that when false positives are found, they’re both more likely to make their way into journals and less likely to subsequently be corrected. In turn, those false positives could lead to poor research outcomes, such as researchers wasting time and money chasing effects that are unlikely to be found again, or actually reinforcing the initial false-positive in the event they go chasing after it, it actually is found by chance, and subsequently published again.

“With such a solid foundation, it’s difficult to see how this could have happened”

Simmons et al (2011) do put forth some suggestions as to how these problems could begin to be remedied. While I think their suggestions are all, in the abstract, good ideas, they would likely also generate a good deal more paperwork for researchers to deal with, and I don’t know a researcher alive who craves more paperwork. While there might be some tradeoff, in this case, between some amount of paperwork and eventual research quality, there is one point that Simmons et al (2011) do not discuss when it comes to remedying this issue, and that’s the matter I have been writing about for some time: the inclusion of theory in research. In my experience, a typical paper in psychology will give one of two explicit reasons for its being conducted: (1) an effect was found previously, so the researchers are looking to either find it again (or not find it), or (2) the authors have a hunch they will find an effect. Without an real theoretical framework surrounding these research projects, there is little need to make sense of or actually explain a finding; one can simply say they discovered a “bias” or a “cognitive blindness” and leave it at that. While I can’t say how much of the false-positive problem could be dealt with by requiring the inclusion of some theoretical framework for understanding one’s results when submitting a manuscript, if any, I feel some theory requirement would still go a long way towards improving the quality of research that ends up getting published. It would encourage researchers to think more deeply about why they’re doing what they’re doing, as well as help readers to understand (and critique) the results they end up seeing. While dealing with false positives should certainly be a concern, merely cutting down on their appearance is not be enough to help research quality in psychology progress appreciably.

References: Brandstaetter (1999). Confidence intervals as an alternative to significance testing. Methods of Psychological Researcher Online, 4.

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

A Frequentist And A Bayesian Walk Into Infinity…

I’m going to preface this post by stating that statistics is not my primary area of expertise. Admittedly, this might not be the best way of generating interest, but non-expertise hasn’t seem to have stopped many a teacher or writer, so I’m hoping it won’t be too much of a problem here. This non-expertise, however, has apparently also not stopped me from stumbling upon an interesting question concerning Bayesian statistics. Whether this conceptual problem I’ve been mulling over would actually prove to be a problem in real-world data collection is another matter entirely. Then again, there doesn’t appear to be a required link between academia and reality, so I won’t worry too much about that while I indulge in the pleasure of a little bit of philosophical play time.

The link between academia and reality is about as strong as the link between my degree and a good job.

So first, let’s run through a quick problem using Bayesian statistics. This is the classic example that I was introduced to the idea by: say that you’re doctor trying to treat an infection that has broken out among a specific population of people. You happened to know that 5% of the people in this population are actually infected and you’re trying to figure out who those people are so you can at least quarantine them. Luckily for you, you happen to have a device that can test for the presence of this infection. If you use this device to test an individual who actually has the disease, it will come back positive 95% of the time; if the individual does not have the disease, it will come back positive 5% of the time. Given that an individual has tested positive for the disease, what is the probability that they actually have it? The answer, unintuitive to most, is 50%.

Though the odds of someone testing positive if they have the disease are high (95%), very few people actually have the disease (5%). So 5% of the 95% of the people who don’t have an infection will test positive and 95% of the of 5% of people who do have an infection also will. In case that example ran by too quickly, here’s another brief video example using hipsters drinking beer over treating infection. This method of statistical testing would seem to have some distinct benefits: for example, it will tell you the probability of your hypothesis, given your data, rather than the probability of your data, given your hypothesis (which, I’m told, is what most people actually want to be calculating). That said, I see two (possibly major) conceptual issue with this type of statistical analysis. If anyone more versed in these matters feels they have good answers to them, I’d be happy to hear it in the comments section.

The first issue was raised by Gelman (2008), who was discussing the usefulness of our prior knowledge. In the above examples, we know some information ahead of time (the prevalence of an infection or hipsters); in real life, we frequently don’t know this information; in fact, it’s often what we’re trying to estimate when we’re doing our hypothesis tests. This puts us in something of a bind when it comes to using Bayes’ formula. Lacking objective knowledge, one could use what are called subjective priors, which represent your own set of preexisting beliefs about how likely certain hypotheses are. Of course, subjective priors have two issues: first, they’re unlikely to be shared uniformly between people, and if your subjective beliefs are not my subjective beliefs, we’ll end up coming to two different conclusions given the same set of data. It’s also probably worth mentioned that subjective beliefs do not, to the best of my knowledge, actually effect the goings-on in the world: that I believe it’s highly probable it won’t rain tomorrow doesn’t matter; it either will or I won’t, and no amount of belief will change that. The second issue concerns the point of the hypothesis test; if you already have a strong prior belief about the truth of a hypothesis, for whatever reason you do, that would seem to suggest there’s little need for you to actually collect any new data.

On the plus side, doing research just got way easier!

One could attempt to get around this problem by using a subjective, but uninformative prior; that is, distribute your belief uniformly over your set of possible outcomes, or to enter into your data analysis with no preconceptions about how it’ll turn out. This might seem like a good solution to the problem, but it would also seem to make your priors all but useless. If you’re multiplying by the same constant, you can just drop it from your analysis. So it would seem in both cases, priors don’t do you a lot of good: they’re either strong, in which case you don’t need to collect more data, or uninformative, in which case they’re pointless to include in the analysis. Now perhaps there are good arguments to be made for subjective priors, but that’s not the primary point I hoped to address; my main criticism involves what’s known as the gambler’s fallacy.

This logical fallacy can be demonstrated with the following example: say you’re flipping a fair coin; given that this coin has come up heads 10 times in a row, how likely will the probability of a tails outcome be on the next flip? The answer, of course, is 50%, as a fair coin is one that is unbiased with respect to which outcome will obtain when you flip it; the probability of a heads outcome using this coin is always as likely as a tails outcome. However, someone making the gambler’s fallacy will suggest that the coin is more likely to come up tails, as all the heads outcomes makes the tails outcome feel more likely; as if a tails outcome is “due” to come up. This is incorrect, as each flip of this coin is independent of the other flips, so knowing what the previous outcomes of this coin have been tell you nothing about what the future outcomes of the coin will be, or, as others have put it, the coin has no memory. As I see it, Bayesian analysis could lead one to engaging in this fallacy (or, more precisely, something like the reverse gambler’s fallacy).

Here’s the example I’ve been thinking about: consider that you have a fair coin and an infinite stretch of time over which you’ll be flipping it. Long strings of heads or tails outcomes (say 10,000 in a row, or even 1,000,000 and beyond in a row) are certainly improbable, but given an infinite amount of time, they become an inevitability outcomes that will obtain eventually. Now, if you’re a good Bayesian, you’ll update your posterior beliefs following each outcome. In essence, after a coin comes up heads, you’ll be more likely to think that it will come up heads on the subsequent flip; since heads have been coming up, more heads are due to come up. Essentially, you’ll be suggesting that these independent events are not actually independent of each other, at least with respect to your posterior beliefs. Given these long strings of heads and tails which will inevitably crop up, over time you will go from believing the coin is fair, to believing that it is nearly completely biased towards both heads and tails and back again.

Though your beliefs about the world can never have enough pairs of flip-flips…

It seems to me, then, that you want some statistical test that will, to some extent, try and take into account data that you did not obtain, but might have if you want to more accurately estimate the parameter (in this case, the fairness of the coin: what might have happened if I flipped the coin another X number of times). This is, generally speaking, anathema to Bayesian statistics as I understand it, who only concern themselves with the data that was collected. Of course, that does raise the question of how one can accurately predict what data they might have obtained, but did not, for which I don’t have a good answer. There’s also the matter of precisely how large of a problem this hypothetical example poses for Bayesian statistics when you’re not dealing with an infinite number of random observations; in the real world, this conceptual problem might not be much of one as these events are highly improbable, so it’s rare that anyone will actually end up making this kind of mistake. That said, it is generally a good thing to be as conceptually aware of possible problems as we can be if we want any hope of fixing them.

References: Gelman, A. (2008). Objections to Bayesian statistics Bayesian Analysis, 3, 445-450 DOI: 10.1214/08-BA318