More About Psychology Research Replicating

By now, many of you have no doubt heard about the reproducibility project, where 100 psychological findings were subjected to replication attempts. In case you’re not familiar with it, the results of this project were less than a ringing endorsement of research in the field: of the expected 89 replications, only 37 were obtained and the average size of the effects fell dramatically; social psychology research in particular seemed uniquely bad in this regard. This suggests that, in many cases, one would be well served by taking many psychological findings with a couple grains of salt. Naturally, this leads many people to wonder whether there’s anyway they might be more confident that an effect is real, so to speak. One possible means through which your confidence might be bolstered is whether or not the research in question contains conceptual replications. What this refers to are cases where the authors of a manuscript report the results of several different studies purporting to measure the same underlying thing with varying methods; that is, they are studying topic A with methods X, Y, and Z. If all of these turn up positive, you ought to be more confident that an effect is real. Indeed, I have had a paper rejected more than once for only containing a single experiment. Journals often want to see several studies in one paper, and that is likely part of the reason why: a single experiment is surely less reliable than multiple ones.

It doesn’t go anywhere, but at least it does so reliably

According to the unknown moderator account of replication failure, psychological research findings are, in essence, often fickle. Some findings might depend on the time of day that measurements were taken, the country of the sample, some particular detail of the stimulus material, whether the experimenter is a man or a woman; you name it. In other words, it is possible that these published effects are real, but only occur in some rather specific contexts of which we are not adequately aware; that is to say they are moderated by unknown variables. If that’s the case, it is unlikely that some replication efforts will be successful, as it is quite unlikely that all of the unique, unknown, and unappreciated moderators will be replicated as well. This is where conceptual replications come in: if a paper contains two, three, or more different attempts at studying the same topic, we should expect that the effect they turn up is more likely to extend beyond a very limited set of contexts and should replicate more readily.

That’s a flattering hypothesis for explaining these replication failures; there’s just not enough replication going on prepublication, so limited findings are getting published as if they were more generalizable. The less-flattering hypothesis is that many researchers are, for lack of a better word, cheating by employing dishonest research tactics. These tactics can include hypothesizing after data is collected, only collecting participants until the data says what the researchers want and then stopping, splitting samples up into different groups until differences are discovered, and so on. There’s also the notorious issue of journals only publishing positive results rather than negative ones (creating a large incentive to cheat, as punishment for doing so is all but non-existent so long as you aren’t just making up the data). It is for these reasons that requiring the pre-registering of research – explicitly stating what you’re going to look at ahead of time – drops positive findings markedly. If research is failing to replicate because the system is being cheated, more internal replications (those from the same authors) don’t really help that much when it comes to predicting external replications (those conducted by outside parties). Internal replications just provide researchers the ability to report multiple attempts at cheating.

These two hypotheses make different predictions concerning the data from the aforementioned reproducibility project: specifically, research containing internal replications ought to be more likely to successfully replicate if the unknown moderator hypothesis is accurate. It certainly would be a strange state of affairs from a “this finding is true” perspective if multiple conceptual replications were no more likely to prove reproducible than single-study papers. It would be similar to saying that effects which have been replicated are no more likely to subsequently replicate than effects which have not. By contrast, the cheating hypothesis (or, more politely, questionable research practices hypothesis) has no problem at all with the idea that internal replications might prove to be as externally replicable as single-study papers; cheating a finding out three times doesn’t mean it’s more likely to be true than cheating it out once.

It’s not cheating; it’s just a “questionable testing strategy”

This brings me to a new paper by Kunert (2016) who reexamined some of the data from the reproducibility project. Of the 100 original papers, 44 contained internal replications: 20 contained just one replication, 10 were replicated twice, 9 were replicated 3 times, and 5 contained more than three. These were compared against the 56 papers which did not contain internal replications to see which would subsequently replicate better (as measured by achieving statistical significance). As it turned out, papers with internal replications externally replicated about 30% of the time, whereas papers without internal replications externally replicated about 40% of the time. Not only were the internally-replicated papers not substantially better, they were actually slightly worse in that regard. A similar conclusion was reached regarding the average effect size: papers with internal replications were no more likely to subsequently contain a larger effect size, relative to papers without such replications.

It is possible, of course, that papers containing internal replications are different than papers which do not contain such replications. This means it might be possible that internal replications are actually a good thing, but their positive effects are being outweighed by other, negative factors. For example, someone proposing a particularly novel hypothesis might be inclined to include more internal replications in their paper than someone studying an established one; the latter researcher doesn’t need more replications in his paper to get it published because the effect has already been replicated in other work. Towards examining this point, Kunert (2016) made use of the 7 identified reproducibility predictors from the Open Science Collaboration – field of study, effect type, original P-value, original effect size, replication power, surprisingness of original effect, and the challenge of conducting the replication – to assess whether internally-replicated work differed in any notable ways from the non-internally-replicated sample. As it turns out, the two samples were pretty similar overall on all the factors except one: field of study. Internally-replicated effects tended to come from social psychology more frequently (70%) than cognitive psychology (54%). As I mentioned before, social psychology papers did tend to replicate less often. However, the unknown moderator effect was not particularly well supported for either field when examined individually.

In summary, then, papers containing internal replications were no more likely to do well when it came to external replications which, in my mind, suggests that something is going very wrong in the process somewhere. Perhaps researchers are making use of their freedom to analyze and collect data as they see fit in order deliver the conclusions they want to see; perhaps journals are preferentially publishing the findings of people who got lucky, relative to those who got it right. These possibilities, of course, are not mutually exclusive. Now I suppose one could continue to make an argument that goes something like, “papers that contain conceptual replications are more likely to be doing something else different, relative to papers with only a single study,” which could potentially explain the lack of strength provided by internal replications, and whatever that “something” is might not be directly tapped by the variables considered in the current paper. In essence, such an argument would suggest that there are unknown moderators all the way down.

“…and that turtle stands on the shell of an even larger turtle…”

While it’s true enough that such an explanation is not ruled out by the current results, it should not be taken as any kind of default stance on why this research is failing to replicate. The “researchers are cheating” explanation strikes me as a bit more plausible at this stage, given that there aren’t many other obvious explanations for why ostensibly replicated papers are no better at replicating. As Kunert (2016) plainly puts it:

This report suggests that, without widespread changes to psychological science, it will become difficult to distinguish it from informal observations, anecdotes and guess work.

This brings us to the matter of what might be done about the issue. There are procedural ways of attempting to address the problem – such as Kunert’s (2016) recommendation for getting journals to publish papers independent of their results – but my focus has, and continues to be, on the theoretical aspects of publication. Too many papers in psychology get published without any apparent need for the researchers to explain their findings in any meaningful sense; instead, they usually just restate and label their findings, or they posit some biologically-implausible function for what they found. Without the serious and consistent application of evolutionary theory to psychological research, implausible effects will continue to be published and subsequently fail to replicate because there’s otherwise little way to tell whether a finding makes sense. By contrast, I find it plausible that unlikely effects can be more plainly spotted – by reviewers, readers, and replicators – if they are all couched within the same theoretical framework; even better, the problems in design can be more easily identified and rectified by considering the underlying functional logic, leading to productive future research.  

References: Kunert, R. (2016). Internal conceptual replications do not increase independent replication success. Psychological Bulletin Review, DOI 10.3758/s13423-016-1030-9

Morality, Alliances, And Altruism

Having one’s research ideas scooped is part of academic life. Today, for instance, I’d like to talk about some research quite similar in spirit to work I intended to do as part of my dissertation (but did not, as it didn’t end up making the cut in the final approved package). Even if my name isn’t on it, it is still pleasing to see the results I had anticipated. The idea itself arose about four years ago, when I was discussing the curious case of Tucker Max’s donation to Planned Parenthood being (eventually) rejected by the organization. To quickly recap, Tucker was attempting to donate half-a-million dollars to the organization, essentially receiving little more than a plaque in return. However, the donation was rejected, it would seem, under fear of building an association between the organization and Tucker, as some people perceived Tucker to be a less-than-desirable social asset. This, of course, is rather strange behavior, and we would recognize it as such if it were observed in any other species (e.g., “this cheetah refused a free meal for her and her cubs because the wrong cheetah was offering it”); refusing free benefits is just peculiar.

“Too rich for my blood…”

As it turns out, this pattern of behavior is not unique to the Tucker Max case (or the Kim Kardashian one…); it has recently been empirically demonstrated by Tasimi & Wynn (2016), who examined how children respond to altruistic offers from others, contingent on the moral character of said others. In their first experiment, 160 children between the ages of 5 and 8 were recruited to make an easy decision; they were shown two pictures of people and told that the people in the pictures wanted to give them stickers, and they had to pick which one they wanted to receive the stickers from. In the baseline conditions, one person was offering 1 sticker, while the other was offering either 2, 4, 8, or 16 stickers. As such, it should come as no surprise that the person offering more stickers was almost universally preferred (71 of the 80 children wanted the person offering more, regardless of how many more).

Now that we’ve established that more is better, we can consider what happened in the second condition where the children received character information about their benefactors. One of the individuals was said to always be mean, having hit someone the other day while playing; the other was said to always be nice, having hugged someone the other day instead. The mean person was always offering more stickers than the nice one. In this condition, the children tended to shun the larger quantity of stickers in most cases: when the sticker ratio was 2:1, less than 25% of children accepted the larger offer from the mean person; the 4:1 and 8:1 ratios were accepted about 40% of the time, and the 16:1 ratio 65% of the time. While more is better in general, it is apparently not better enough for children to overlook the character information at times. People appear willing to forgo receiving altruism when it’s coming from the wrong type of person. Fascinating stuff, especially when one considers that such refusals end up leaving the wrongdoers with more resources than they would otherwise have (if you think someone is mean, wouldn’t you be better off taking those resources from them, rather than letting them keep them?).

This line was replicated in 64 very young children (approximately one-year old). In this experiment, the children observed a puppet show in which two puppets offered them crackers, with one offering a single cracker and the other offering either 2 or 8. Again, unsurprisingly, the majority of children accepted the larger offer, regardless of how much larger it was (24 of 32 children). In the character information condition, one puppet was shown to be a helper, assisting another puppet in retrieving a toy from a chest, whereas the other puppet was a hinderer, preventing another from retrieving a toy. The hindering puppet, as before, now offered the greater number of crackers, whereas the helper only offered one cracker. When the hindering puppet was offering 8 crackers, his offer was accepted about 70% of the time, which did not differ from the baseline group. However, when the hindering puppet was only offering 2, the acceptance rate was a mere 19%. Even young children, it would seem, are willing to avoid accepting altruism from wrongdoers, assuming the difference in offers isn’t too large.

“He’s not such a bad guy once you get $10 from him”

While neat, these results beg for a deeper explanation as to why we should expect such altruism to be rejected. I believe hints of this explanation are provided by the way Tasimi & Wynn (2016) write about their results:

Taken together, these findings indicate that when the stakes are modest, children show a strong tendency to go against their baseline desire to optimize gain to avoid ‘‘doing business” with a wrongdoer; however, when the stakes are high, children show more willingness to ‘‘deal with the devil…”

What I find strange about that passage is that children in the current experiments were not “doing business” or “making deals” with the altruists; there was no quid pro quo going on. The children were no more doing business with the others than they are doing business with a breastfeeding mother. Nevertheless, there appears to an implicit assumption being made here: an individual who accepts altruism from another is expected to pay that altruism back in the future. In other words, merely receiving altruism from another generates the perception of a social association between the donor and recipient.

This creates an uncomfortable situation for the recipient in cases where the donor has enemies. Those enemies are often interested in inflicting costs on the donor or, at the very least, withholding benefits from him. In the latter case, this makes that social association with the donor less beneficial than it otherwise might, since the donor will have fewer expected future resources to invest in others if others don’t help him; in the former case, not only does the previous logic hold, but the enemies of your donor might begin to inflict costs on you as well, so as to dissuade you from helping him. Putting this into a quick example Jon – your friend – goes out an hurts Bob, say, by sleeping with Bob’s wife. Bob and his friends, in response, both withhold altruism from Jon (as punishment) and might even be inclined to attack him for his transgression. If they perceive you as helping Jon – either by providing him with benefits or by preventing them from hurting Jon – they might be inclined to withhold benefits from or punish you as well until you stop helping Jon as a means of indirect punishment. To turn the classic phrase, the friend of my enemy is also my enemy (just as the enemy of my enemy is my friend).

What cues might they use to determine if you’re Jon’s ally? Well, one likely useful cue is whether Bob directs altruism towards you. If you are accepting his altruism, this is probably a good indication that you will be inclined to reciprocate it later (else risk being labeled a social cheater or free rider). If you wish to avoid condemnation and punishment by proxy, then, one route to take is to refuse benefits from questionable sources. This risk can be overcome, however, in cases where the morally-questionable donor is providing you a large enough benefit which, indeed, was precisely the pattern of results observed here. What will determine what counts as “large enough” should be expected to vary as a function of a few things, most notably the size and nature of the transgressions, as well as the degree of expected reciprocity. For example, receiving large donations from morally-questionable donors should be expected to be more acceptable to the extent the donation is made anonymously vs publicly, as anonymity might reduce the perceived social associations between donor and recipient.

You might also try only using “morally clean” money

Importantly (as far as I’m concerned) this data fits well within my theory of morality – where morality is hypothesized to function as an association-management mechanism – but not particularly well with other accounts: altruistic accounts of morality should predict that more altruism is still better, dynamic coordination says nothing about accepting altruism, as giving isn’t morally condemned, and self-interest/mutualistic accounts would, I think, also suggest that taking more money would still be preferable since you’re not trying to dissuade others from giving. While I can’t help but feel some disappointment that I didn’t carry this research out myself, I am both happy with the results that came of it and satisfied with the methods utilized by the authors. Getting research ideas scooped isn’t so bad when they turn out well anyway; I’m just happy enough to see my main theory supported.  

References: Tasimi, A. & Wynn, K. (2016). Costly rejection of wrongdoers by infants and children. Cognition, 151, 76-79.

Benefiting Others: Motives Or Ends?

The world is full of needy people; they need places to live, food to eat, medical care to combat biological threats, and, if you ask certain populations in the first world, a college education. Plenty of ink has been spilled over the matter of how to best meet the needs of others, typically with a focus on uniquely needy populations, such as the homeless, poverty-stricken, sick, and those otherwise severely disadvantaged. In order to make meaningful progress in such discussions, there arises the matter of precisely why - in the functional sense of the word – people are interested in helping others, as I believe the answer(s) to that question will be greatly informative when it comes to determining the most effective strategies for doing so. What is very interesting about these discussions is that the focus is frequently placed on helping others altruistically; delivering benefits to others in ways that are costly for the person doing the helping. The typical example of this involves charitable donations, where I would give up some of my money so that someone else can benefit. What is interesting about this focus is that our altruistic systems often seem to face quite a bit of pushback from other parts of our psychology when it comes to helping others, resulting in fairly poor deliveries of benefits. It represents a focus on the means by which we help others, rather than really serving to improve the ends of effective helping. 

For instance. this sign isn’t asking for donations

As a matter of fact, the most common ways of improving the lives of others doesn’t involve any altruism at all. For an alternative focus, we might consider the classic Adam Smith quote pertaining to butchers and bakers:

But man has almost constant occasion for the help of his brethren, and it is in vain for him to expect it from their benevolence only. He will be more likely to prevail if he can interest their self-love in his favour, and show them that it is for their own advantage to do for him what he requires of them. Whoever offers to another a bargain of any kind, proposes to do this. Give me that which I want, and you shall have this which you want, is the meaning of every such offer; and it is in this manner that we obtain from one another the far greater part of those good offices which we stand in need of. It is not from the benevolence of the butcher, the brewer, or the baker that we expect our dinner, but from their regard to their own interest.

In short, Smith appears to recommend that, if we wish to effectively meet the needs of others (or have them meet our needs), we must properly incentivize that other-benefiting behavior instead of just hoping people will be willing to continuously suffer costs. Smith’s system, then, is more mutualistic or reciprocal in nature. There are a lot of benefits to trying to use these mutualistic and reciprocally-altruistic cognitive mechanisms, rather than altruistic ones, some of which I outlined last week. Specifically, altruistic systems typically direct benefits preferentially towards kin and social allies, and such a provincial focus is unlikely to deliver benefits to the needy individuals in the wider world particularly well (e.g., people who aren’t kin or allies). If, however, you get people to behave in a way that benefits themselves and just so happen to benefit others as a result, you’ll often end up with some pretty good benefit delivery. This is because you don’t need to coerce people into helping themselves.  

So let’s say we’re faced with a very real-world problem: there is a general shortage of organs available for people in need of transplants. What cognitive systems do we want to engage to solve that problem? We could, as some might suggest, make people more empathetic to the plight of those suffering in hospitals, dying from organ failure; we might also try to convince people that signing up as an organ donor is the morally-virtuous thing to do. Both of these plans might increase the number of people willing to posthumously donate their organs, but perhaps there are much easier and effective ways to get people to become organ donors even if they have no particular interest in helping others. I wanted to review two such candidate methods today, neither of which require that people’s altruistic cognitive systems be particular engaged.

The first method comes to us from Johnson & Goldstein (2003), who examine some cross-national data on rates of organ donor status. Specifically, they note an oddity in the data: very large and stable differences exist between nations in organ donor status, even after controlling for a number of potentially-relevant variables. Might these different rates exist because of people’s preferences for being an organ donor varying markedly between countries? It seems unlikely, unless people in Germany have an exceedingly unpopular opinion toward being an organ donor (14% are donors, from the figures cited), while people in Sweden are particularly interested in it (86%). In fact, in the US, support for organ donation is at near ceiling levels, yet a large gap persists between those who support it (95%) and those who indicated on a driver’s license they were donors (51% in 2005; 60% in 2015) or who had signed a donor card (30%). If it’s not people’s lack of support for such a policy, what is explaining the difference?

A poor national sense for graphic design?

Johnson & Goldstein (2003) float a simple explanation for most of the national differences: whether donor programs were opt-in or opt-out. What that refers to is the matter of, assuming someone has made no explicit decision as to what happens to their organs after they die, what decision would be treated as the default? In opt-in countries (like Germany and the US), non-donor status would be assumed unless someone signs up to be a donor; in opt-out countries, like Sweden, people are assumed to be donors unless they indicate that they do not wish to be one. As the authors report, the opt-in countries have much lower effective consent rates (on average, 60% lower) and the two groups represent non-overlapping populations. That data supplements the other experimental findings from Johnson & Goldstein (2003) as well. The authors had 161 participants take part in an experiment where they were asked to imagine they had moved to a new state. This state either treated organ donation as the default option or non-donation as the default, and participants were asked whether they would like to confirm or change their status. There was also a third condition where no default answer was provided. When no default answer was given, 79% of participants said they would be willing to be an organ donor; a percentage which did not differ from those who confirmed their donor status when it was the default (82%). However, when non-donor status was the default, only 42% of the participants changed their status to donor. 

So defaults seem to matter quite a bit, but let’s assume that a nation isn’t going to change its policy from opt-in to opt-out anytime soon. What else might we do if we wanted to improve the rates of people signing up to be an organ donor in the short term? Eyting et al (2016) tested a rather simple method: paying people €10. The researchers recruited 320 German university students who did not currently have an organ donor card and provided them the opportunity to fill one out. These participants were split into three groups: one in which there was no compensation offered for filling out the card, one in which they would personally receive €10 for filling out a card (regardless of which choice they picked: donor or non-donor), and a final condition in which €10 would be donated to a charitable organization (the Red Cross) if they filled out a card. No differences were observed between the percentage of participants who filled out the card between the control (35%) and charity (36%) conditions. However, in the personal benefit group, there was a spike in the number of people filling out the card (72%). Not all those who filled out the cards opted for donor status, though. Between conditions, the percentage of people who both (a) filled out the card and (b) indicated they wanted to be a donor where about 44% in the personal payment condition, 28% in the control condition, and only 19% in the charity group. Not only did the charity appeal not seem particularly effective, it was even nominally counterproductive.

“I already donated $10 to charity and now they want my organs too?!”

Now, admittedly, helping others because there’s something in it for you isn’t quite as sexy (figuratively speaking) as helping because you’re driven by an overwhelming sense of empathy, conscience, or simply helping for no benefit at all. This is because there’s a lower signal value in that kind of self-beneficial helping; it doesn’t predict future behavior in the absence of those benefits. As such, it’s unlikely to be particularly effective at building meaningful social connections between helpers and others. However, if the current data is any indication, such helping is also likely to be consistently effective. If one’s goal is to increase the benefits being delivered to others (rather than building social connections), that will often involve providing valued incentives for the people doing the helping.

On one final note, it’s worth mentioning that these papers only deal with people becoming a donor after death; not the prospect of donating organs while alive. If one wanted to, say, incentivize someone to donate a kidney while alive, a good way to do so might be to offer them money; that is, allow people to buy and sell organs they are already capable of donating. If people were allowed to engage in mutually-beneficial interactions when it came to selling organs, it is likely we would see certain organ shortages decrease as well. Unfortunately for those in need of organs and/or money, our moral systems often oppose this course of action (Tetlock, 2000), likely contingent on perceptions about which groups would be benefiting the most. I think this serves as yet another demonstration that our moral sense might not be well-suited for maximizing the welfare of people in the wider social world, much like our empathetic systems don’t.

References: Eyting, M., Hosemann, A., & Johannesson, M. (2016). Can monetary incentives increase organ donations? Economics Letters, 142, 56-58.

Johnson, E. & Goldstein, D. (2003). Do defaults save lives? Science, 132, 1338-1339.

Tetlock, P. (2000). Coping with trade-offs: Psychological constraints and political implications. In Elements of Reason: Cognition, Choice, & the Bounds of Rationality. Ed. Lupia, A., McCubbins, M., & Popkin, S. 239-322.