Let’s say you find yourself in charge of a group of children. Since you’re a relatively-average psychologist, you have a relatively strange hypothesis you want to test: you want to see whether wearing a red shirt will make children better at dodge ball. You happen to think that it will. I say this hypothesis is strange because you derived it from, basically, nothing; it’s just a hunch. Little more than a “wouldn’t it be cool if it were true?” idea. In any case, you want to run a test of your hypothesis.You begin by lining the students up, then you walk past them and count aloud: “1, 2, 1, 2, 1…”. All the children with a “1″ go an put on a red shirt and are on a team together; all the children with a “2″ go and pick a new shirt to put on from a pile of non-red shirts. They serve as your control group. The two teams then play each other in a round of dodge ball. The team wearing the red shirts comes out victorious. In fact, they win by a substantial margin. This must mean that the wearing the red shirts made students better at dodge ball, right? Well, since you’re a relatively-average psychologist, you would probably conclude that, yes, the red shirts clearly have some effect. Sure, your conclusion is, at the very least, hasty and likely wrong, but you are only an average psychologist: we can’t set the bar too high.
A critical evaluation of the research could note that just because the children were randomly assigned to groups, it doesn’t mean that both groups were equally matched to begin with. If the children in the red shirt group were just better beforehand, that could drive the effect. It’s also likely that the red shirts might have had very little to do with which team ended up winning. The pressing question here would seem to be why would we expect red shirts to have any effect? It’s not as if a red shirt makes a child quicker, stronger, or better able to catch or throw than before; at least not for any theoretical reason that comes to mind. Again, this hypothesis is a strange one when you consider its basis. Let’s assume, however, that wearing red shirts actually did make children perform better, because it helped children tap into some preexisting skill set. This raises the somewhat obvious question: why would children require a red shirt to tap into that previously-untapped resource? If being good at the game is important socially – after all, you don’t want to get teased by the other children for your poor performance – and children could do better, it seems, well, odd that they would ever do worse. One would need to posit some kind of trade-off effected by shirt color, which sounds like kind of an odd variable for some cognitive mechanism to take into account.
Nevertheless, like any psychologist hoping to further their academic career, you publish your results in the Journal of Inexplicable Findings. The “Red Shirt Effect” becomes something of a classic, reported in Intro to Psychology textbooks. Published reports start cropping up from different people who have had other children wear red shirts and perform various tasks athletic task relatively better. While none of these papers are direct replications of your initial study, they also have children wearing red shirts outperforming their peers, so they get labeled “conceptual replications”. After all, since the concepts seem to be in order, they’re likely tapping the same underlying mechanism. Of course, these replications still don’t deal with the theoretical concerns discussed previously, so some other researchers begin to get somewhat suspicious about whether the “Red Shirt Effect” is all it’s made out to be. Part of these concerns are based around an odd facet of how publication works: positive results – those that find effects – tend to be favored for publication over studies that don’t find effects. This means that there may well be other researchers who attempted to make use of the Red Shirt Effect, failed to find anything and, because of their null or contradictory results, also failed to publish anything.
Eventually, word reaches you of a research team that attempted to replicate the Red Shirt Effect a dozen times in the same paper and failed to find anything. More troubling still, for you academic career, anyway, their results saw publication. Naturally, you feel pretty upset by this. Clearly the research team was doing something wrong: maybe they didn’t use the proper shade of red shirt; maybe they used a different brand of dodge balls in their study; maybe the experimenters behaved in some subtle way that was enough to counteract the Red Shirt Effect entirely. Then again, maybe the journal the results were published in doesn’t have good enough standards for their reviewers. Something must be wrong here; you know as much because your Red Shirt Effect was conceptually replicated many times by other labs. The Red Shirt Effect just must be there; you’ve been counting the hits in the literature faithfully. Of course, you also haven’t been counting the misses which were never published. Further, you were counting the slightly-altered hits as “conceptual replications but not the slightly-altered misses as “conceptual disconfirmations”. You still haven’t managed to explain, theoretically, why we should expect to see the Red Shirt Effect anyway, either. Then again, why would any of that matter to you? Part of your reputation is at stake.
In somewhat-related news, there have been some salty comments from Social psychologist Ap Dijksterhuis aimed at a recent study (and coverage of the study, and the journal it was published in) concerning nine failures to replicate some work Ap did on intelligence priming, as well as work done by others on intelligence priming (Shanks et al, 2013). The initial idea of intelligence priming, apparently, was that priming subjects with professor-related cues made them better at answering multiple-choice, general-knowledge questions, whereas priming subjects with soccer-hooligan related cues made them perform worse (and no; I’m not kidding. It really was that odd). Intelligence itself is a rather fuzzy concept, and it seems that priming people to think about professors – people typically considered higher in some domains of that fuzzy concept – is a poor way to make them better at multiple choice questions. As far as I can tell, there was no theory surrounding why primes should work that way or, more precisely, why people should lack access to such knowledge in absence of some vague, unrelated prime. At the very least, none was discussed.
It wasn’t just that the failures to replicate reported by Shanks et al (2013) were non-significant but in the right direction, mind you; they often seemed to go in the wrong direction. Shanks et al (2013) even looked for demand characteristics explicitly, but couldn’t find them either. Nine consecutive failures are surprising in light of the fact that the intelligence priming effects were previously reported as being rather large. It seem rather peculiar that large effects can disappear so quickly; they should have had very good chance of replicating, were they real. Shanks et al (2013) rightly suggest that many of the confirmatory studies of intelligence priming, then, might represent publication bias, researcher degrees of freedom in analyzing data, or both. Thankfully, the salty comments of Ap reminded readers that: “the finding that one can prime intelligence has been obtained in 25 studies in 10 different labs”. Sure; and when a batter in the MLB only counts the times he hit the ball while at bat, his batting average would be a staggering 1.000. Counting only the hits and not the misses will sure make it seem like hits are common, no matter how rare they are. Perhaps Ap should have thought about professors more before writing his comments (though I’m told thinking about primes ruins them as well, so maybe he’s out of luck).
I would like to add there were similarly salty comments leveled by another Social Psychologist, John Bargh, when his work on priming old stereotypes on walking speed failed to replicate (though John has since deleted his posts). The two cases bear some striking similarties: claims of other “conceptual replications”, but no claims of “conceptual failures to replicate”; personal attacks on the credibility of the journal publishing the results; personal attacks on the researchers who failed to replicate the finding; even personal attacks on the people reporting about the failures to replicated. More interestingly, John also suggested that the priming effect was apparently so fragile that even minor deviations from the initial experiment could throw the entire thing into disarray. Now it seems to me that if your “effect” is so fleeting that even minor tweaks to the research protocol can cancel it out completely, then you’re really not dealing with much in the way of importance concerning the effect, even were it real. That’s precisely the kind of shooting-yourself-in-the-foot a “smarter” person might have considered leaving out of their otherwise persuasive tantrum.
I would also add, for the sake of completeness, that priming effects of stereotype threat haven’t replicated out well either. Oh, and the effects of depressive realism don’t show much promise. This brings me to my final point on the matter: given the risks posed by research degrees of freedom and publication bias, it would be wise to enact better safeguards against this kind of problem. Replications, however, only go so far. Replications require researchers willing to do them (and they can be low-reward, discouraged activities) and journals willing to publish them with sufficient frequency (which many do not, currently). Accordingly, I feel replications can only take us so far in fixing the problem. A simple – though only partial – remedy for the issue is, I feel, to require the inclusion of actual theory in psychological research; evolutionary theory in particular. While it does not stop false positives from being published, it at least allows other researchers and reviewers to more thoroughly assess the claims being made in papers. This allows poor assumptions to be better weeded out and better research projects crafted to address them directly. Further, updating old theory and providing new material is a personally-valuable enterprise. Without theory, all you have is a grab bag of findings, some positive, some negative, and no idea what to do with them or how they are to be understood. Without theory, things like intelligence priming – or Red Shirt Effects – sound valid.
References: Shanks, D., Newell, B., Lee, E., Balakrishnan, D., Ekelund, L., Cenac, Z., Kavvadia, F., & Moore, C. (2013). Priming Intelligent Behavior: An Elusive Phenomenon PLoS ONE, 8 (4) DOI: 10.1371/journal.pone.0056515