Count The Hits; Not The Misses

At various points in our lives, we have all read or been told anecdotes about how someone turned a bit of their life around. Some of these (or at least variations of them) likely sound familiar: “I cut out bread from my diet and all the sudden felt so much better”; “Amy made a fortune working from home selling diet pills online”; “After the doctors couldn’t figure out what was wrong with me, I started drinking this tea and my infection suddenly cleared up”. The whole point of such stories is to try and draw a casual link, in these cases: (1) eating bread makes you feel sick, (2) selling diet pills is a good way to make money, and (3) tea is useful for combating infections. Some or all of these statements may well be true, but the real problem with these stories is the paucity of data upon which they are based. If you wanted to be more certain about those statements, you want more information. Sure; you might have felt better after drinking that tea, but what about the other 10 people who drank similar tea and saw no results? How about all the other people selling diet pills who were in the financial hole from day one and never crawled out of it because it’s actually a scam? If you want to get closer to understanding the truth value of those statements, you need to consider the data as a whole; both stories of success and stories of failure. However, stories of someone not getting rich from selling diet pills aren’t quite as moving, and so don’t see the light of day; at least not initially. This facet of anecdotes was made light of by The Onion several years ago (and Clickhole had their own take more recently).

“At first he failed, but with some positive thinking he continued to fail over and over again”

These anecdotes often try and throw the spotlight on successful cases (hits) while ignoring the unsuccessful ones (misses), resulting in a biased picture of how things will work out. They don’t get us much closer to the truth. Most people who create and consume psychology research would like to think that psychologists go beyond these kinds of anecdotes and generate useful insights into how the mind works, but there have been a lot of concerns raised lately about precisely how much further they go on average, largely owing the the results of the reproducibility project. There have been numerous issues raised about the way psychology research is conducted: either in the form of advocacy for particular political and social positions (which distorts experimental designs and statistical interpretations) or the selective ways in which data is manipulated or reported to draw attention to successful data without acknowledging failed predictions. The result has been quite a number of false positives and overstated real ones cropping up in the literature.

While these concerns are warranted, it is difficult to quantify the extent of the problems. After all, very few researchers are going to come out and say they manipulated their experiments or data to find the results they wanted because (a) it would only hurt their careers and (b) in some cases, they aren’t even aware that they’re doing it, or that what they’re doing is wrong. Further, because most psychological research isn’t preregistered and null findings aren’t usually published, figuring out what researchers hoped to find (but did not) becomes a difficult undertaking just by reading the literature. Thankfully, a new paper from Franco et al (2016) brings some data to bear on the matter of how much underreporting is going on. While this data will not be the final word on the subject by any means (largely owing to their small sample size), they do provide some of the first steps in the right direction.

Franco et al (2016) report on a group of psychology experiments whose questionnaires and data were made publicly available. Specifically, these come from the Time-sharing Experiments for the Social Sciences (TESS), an NSF program in which online experiments are embedded in nationally-representative population surveys. Those researchers making use of TESS face strict limits on the number of questions they can ask, we are told, meaning that we ought to expect they would restrict their questions to the most theoretically-meaningful ones. In other words, we can be fairly confident that the researchers had some specific predictions they hoped to test for each experimental condition and outcome measure, and that these predictions were made in advance of actually getting the data. Franco et al (2016) were then able to track the TESS studies through to the eventual published versions of the papers to see what experimental manipulations and results were and were not reported. This provided the authors with a set of 32 semi-preregistered psychology experiments to examine for reporting biases.

A small sample I will recklessly generalize to all of psychology research

The first step was to compare the number of experimental conditions and outcome variables that were present in the TESS studies to the number that ultimately turned up in published manuscripts (i.e. are the authors reporting what they did and what they measured?). Overall, 41% of the TESS studies failed to report at least one of their experimental conditions; while there were an average of 2.5 experimental conditions in the studies, the published papers only mentioned an average of 1.8. In addition, 72% of the papers failed to report all their outcomes variables; while there were an average of 15.4 outcome variables in the questionnaires, the published reports only mentioned 10.4  Taken together, only about 1-in-4 of the experiments reported all of what they did and what they measured. Unsurprisingly, this pattern extended to the size of the reported effects as well. In terms of statistical significance, the median reported p-value was significant (.02), while the median unreported p-value was not (.32); two-thirds of the reported tests were significant, while only one-forth of the unreported tests were. Finally, published effect sizes were approximately twice as large as unreported ones.

Taken together, the pattern that emerged is that psychology research tends to underreport failed experimental manipulations, measures that didn’t pan out, and smaller effects. This should come as no surprise to almost anyone who has spent much time around psychology researchers or the researchers themselves who have tried to publish null findings (or, in fact, have tried to publish almost anything). Data is often messy and uncooperative, and people are less interested in reading about the things that didn’t work out (unless they’re placed in the proper contexts, where failures to find effects can actually be considered meaningful, such as when you’re trying to provide evidence against a theory). Nevertheless, the result of such selective reporting on what appears to be a fairly large scale is that the overall trustworthiness of reported psychology research dips ever lower, one false-positive at a time.

So what can be done about this issue? One suggestion that is often tossed around is the prospect that researchers should register their work in advance, making it clear what analyses they will be conducting and what predictions they have made. This was (sort of) the case in the present data, and Franco et al (2016) endorse this option. It allows people to assess research as more of a whole than just relying on the published accounts of it. While that’s a fine suggestion, it only goes so far to improving the state of the literature. Specifically, it doesn’t really help the problem of journals not publishing null findings in the first place, nor does it necessarily disallow researchers from doing post-hoc analyses of their data either and turning up additional false positives. What is perhaps a more ambitious way of alleviating these problems that comes to mind would be to collectively change the way journals accept papers for publication. In this alternate system, researchers would submit an outline of their article to a journal before the research is conducted, making clear (a) what their manipulations will be, (b) what their outcome measures will be, and (c) what statistical analyses they will undertake. Then, and this is important, before either the researcher or the journals know what the results will be, the decision will be made to publish the paper or not. This would allow null results to make their way into mainstream journals while also allowing the researchers to build up their own resumes if things don’t work out well. In essence, it removes some of the incentives for researchers to cheat statistically. The assessment of the journals will then be based not on whether interesting results emerged, but rather on whether a sufficiently important research question had been asked.

Which is good, considering how often real, strong results seem to show up

There are some downsides to that suggestion, however. For one, the plan would take some time to enact even if everyone was on board. Journals would need to accept a paper for publication weeks or months in advance of the paper itself actually being completed. This would pose some additional complications for journals inasmuch as researchers will occasionally fail to complete the research at all, in timely manner, or submit sub-par papers not worthy of print quite yet, leaving possible publication gaps. Further, it will sometimes mean that an issue of a journal goes out without containing any major advancements to the field of psychological research (no one happened to find anything this time), which might negatively affect the impact factor of the journals in question. Indeed, that last part is probably the biggest impediment to making major overhauls to the publication system that’s currently in place: most psychology research probably won’t work out all that well, and that will probably mean fewer people ultimately interested in reading about and citing it. While it is possible, I suppose, that null findings would actually be cited at similar rates to positive ones, that remains to be seen, and in the absence of that information I don’t foresee journals being terribly interested in changing their policies and taking that risk.

References: Franco, A., Malhotra, N., & Simonovits, G. (2016). Underreporting in psychology experiments: Evidence from a study registry. Social Psychological & Personality Science, 7, 8-12.

Who Deserves Healthcare And Unemployment Benefits?

As I find myself currently recovering from a cold, it’s a happy coincidence that I had planned to write about people’s intuitions about healthcare this week. In particular, a new paper by Jensen & Petersen (2016) attempted to demonstrate a fairly automatic cognitive link between the mental representation of someone as “sick” and of that same target as “deserving of help.” Sickness is fairly unique in this respect, it is argued, because of our evolutionary history with it: as compared with what many refer to as diseases of modern lifestyle (including those resulting from obesity and smoking), infections tended to strike people randomly; not randomly in the sense that anyone is equally as likely to get sick, but more in the sense that people often had little control over when they did. Infections were rarely the result of people intentionally seeking them out or behaving in certain ways. In essence, then, people view those who are sick as unlucky, and unlucky individuals are correspondingly viewed as being more deserving of help than those who are responsible for their own situation.

…and more deserving of delicious, delicious pills

This cognitive link between luck and deservingness can be partially explained by examining expected returns on investment in the social world (Tooby & Cosmides, 1996). In brief, helping others takes time and energy, and it would only be adaptive for an organism to sacrifice resources to help another if doing so was beneficial to the helper in the long term. This is often achieved by me helping you at a time when you need it (when my investment is more valuable to you than it is to me), and then you helping me in the future when I need it (when your investment is more valuable to me than it is to you). This is reciprocal altruism, known by the phrase, “I scratch your back and you scratch mine.” Crucially, the probability of receiving reciprocation from the target you help should depend on why that target needed help in the first place: if the person you’re helping is needy because of their own behavior (i.e., they’re lazy), their need today is indicative of their need tomorrow. They won’t be able to help you later for the same reasons they need help now. By contrast, if someone is needy because they’re unlucky, their current need is not as diagnostic of their future need, and so it is more likely they will repay you later. Because the latter type is more likely to repay than the former, our intuitions about who deserves help shift accordingly.

As previously mentioned, infections tend to be distributed more randomly; my being sick today (generally) doesn’t tell you much about the probability of my future ability to help you once I recover. Because of that, the need generated by infections tends to make sick individuals look like valuable targets of investment: their need state suggests they value your help and will be grateful for it, both of which likely translate into their helping you in the future. Moreover, the needs generated by illnesses can frequently be harmful, even to the point of death if assistance isn’t provided. The greater the need state to be filled, the greater the potential for alliances to be formed, both with and against you. To place that point in a quick, yet extreme, example, pulling someone from a burning building is more likely to ingratiate them to you than just helping them move; conversely, failing to save someone’s life when it’s well within your capabilities can set their existing allies against you.

The sum total of this reasoning is that people should intuitively perceive the sick as more deserving of help than those suffering from other problems that cause need. The particular other problem that Jensen & Petersen (2016) contrast sickness with is unemployment, which they suggest is a fairly modern problem. The conclusion drawn by the authors from these points is that the human mind – given its extensive history with infections and their random nature – should automatically tag sick individuals as deserving of assistance (i.e., broad support for government healthcare programs), while our intuitions about whether the unemployed deserve assistance should be much more varied, contingent on the extent to which unemployment is viewed as being more luck- or character-based. This fits well with the initial data that Jensen & Petersen (2016) present about the relative, cross-national support for government spending on healthcare and unemployment: not only is healthcare much more broadly supported than unemployment benefits (in the US, 90% vs 52% of the population support government assistance), but support for healthcare is also quite a bit less variable across countries.

Probably because the unemployed don’t have enough bake sales or ribbons

Some additional predictions drawn by the authors were examined across a number of studies in the paper, only two of which I would like to focus on for length constraints. The first of these studies presented 228 Danish participants with one of four scenarios: two in which the target was sick and two in which the target was unemployed. In each of these conditions, the target was also said to be lazy (hasn’t done much in life and only enjoys playing video games) or hardworking (is active and does volunteer work; of note, the authors label the lazy/hardworking conditions as high/low control, respectively, but I’m not sure that really captures the nature of the frame well). Participants were asked how much an individual like that deserved aid from the government when sick/unemployed on a 7-point scale (which was converted to a 0-1 scale for ease of interpretation).

Overall, support for government aid was lower in both conditions when the target was framed as being lazy, but this effect was much larger in the case of unemployment. When it came to the sick individual, support for healthcare for the hardworking target was about a 0.9, while support for the lazy one dipped to about 0.75; by contrast, the hardworking unemployed individual was supported with benefits at about 0.8, while the lazy one only received support around the 0.5 point. As the authors put it, the effect of the deservingness information was about 200% less influential when it came to sickness.

There is an obvious shortcoming in that study, however: being lazy has quite a bit less to do with getting sick than it does to getting a job. This issue was addressed better in the third study where the stimuli were more tailored to the problems. In the case of unemployed individuals, they were described as being unskilled workers who were told to get further training by their union, with the union even offering to help. The individual either takes or does not take the additional training, but either way eventually ends up unemployed. In the case of healthcare, the individual is described as being a long-term smoker who was repeatedly told by his doctor to quit. The person either eventually quits smoking or does not, but either way ends up getting lung cancer. The general pattern of results from study two replicated again: for the smoker, support for government aid hovered around 0.8 when he quit and 0.7 when he did not; for the unemployed person, support was about 0.75 when he took the training and around 0.55 when he did not.

“He deserves all that healthcare for looking so cool while smoking”

While there does seem to be evidence for sicknesses being cognitively tagged as more deserving of assistance than unemployment (there were also some association studies I won’t cover in detail), there is a recurrent point in the paper that I am hesitant about endorsing fully. The first mention of this point is found early on in the manuscript, and reads:

“Citizens appear to reason as if exposure to health problems is randomly distributed across social strata, not noting or caring that this is not, in fact, the case…we argue that the deservingness heuristic is built to automatically tag sickness-based needs as random events…”

A similar theme is mentioned later in the paper as well:

“Even using extremely well-tailored stimuli, we find that subjects are reluctant to accept explicit information that suggests that sick people are undeserving.”

In general I find the data they present to be fairly supportive of this idea, but I feel it could do with some additional precision. First and foremost, participants did utilize this information when determining deservingness. The dips might not have been as large as they were for unemployment (more on that later), but they were present. Second, participants were asked about helping one individual in particular. If, however, sickness is truly being automatically tagged as randomly distributed, then deservingness factors should not be expected to come into play when decisions involve making trade-offs between the welfare of two individuals. In a simple case, a hospital could be faced with a dilemma in which two patients need a lung transplant, but only a single lung is available. These two patients are otherwise identical except one has lung cancer due to a long history of smoking, while the other has lung cancer due to a rare infection. If you were to ask people which patient should get the organ, a psychological system that was treating all illness as approximately random should be indifferent between giving it to the smoker or the non-smoker. A similar analysis could be undertaken when it comes to trading-off spending on healthcare and non-healthcare items as well (such as making budget cuts to education or infrastructure in favor of healthcare). 

Finally, there are two additional factors which I would like to see explored by future research in this area. First, the costs of sickness and unemployment tend to be rather asymmetric in a number of ways: not only might sickness be more often life-threatening than unemployment (thus generating more need, which can swamp the effects of deservingness to some degree), but unemployment benefits might well need to be paid out over longer periods of time than medical ones (assuming sickness tends to be more transitory than unemployment). In fact, unemployment benefits might actively encourage people to remain unemployed, whereas medical benefits do not encourage people to remain sick. If these factors could somehow be held constant or removed, a different picture might begin to emerge. I could imagine deservingness information mattering more when a drug is required to alleviate discomfort, rather than save a life. Second - though I don’t know to what extent this is likely to be relevant – the stimulus materials in this research all ask about whether the government ought to be providing aid to sick/unemployed people. It is possible that somewhat different responses might have been obtained if some measures were taken about the participant’s own willingness to provide that aid. After all, it is much less of a burden on me to insist that someone else ought to be taking care of a problem relative to taking care of it myself.

References: Jensen, C. & Petersen, M. (2016). The deservingness heuristic and the politics of health care. American Journal of Political Science, DOI: 10.1111/ajps.12251

 Tooby, J. & Cosmides, L. (1996). Friendship and the banker’s paradox:Other pathways to the evolution of adaptations for altruism. Proceedings of the British Academy, 88, 119-143

Absolute Vs Relative Mate Preferences

As the comedian Louis CK quipped some time ago, “Everything is amazing right now and nobody is happy.” In that instance he was referring to the massive technological improvements that have arisen in the fairly-recent past which served to make our lives easier and more comfortable. Reflecting on the level of benefit that this technology has added to our lives (e.g., advanced medical treatments, the ability to communicate with people globally in an instant, or to travel globally in the matter of a few hours, etc), it might feel kind of silly that we aren’t content with the world; this kind of lifestyle sure beats living in the wilderness in a constant contest to find food, ward off predators and parasites, and endure the elements. So why aren’t we happy all the time? There are many ways to answer this question, but I wanted to focus on one in particular: specifically, given our nature as a social species, much of our happiness is determined by relative factors. If everyone is fairly well off in the absolute sense, you being well off doesn’t help you when it comes to being selected as a friend, cooperative partner, or mate because it doesn’t signal anything special about your value to others. What you are looking for in that context is not to be doing well on an absolute level, but to be doing better than others.

 If everyone has an iPhone, no one has an iPhone

To place this in a simple example, if you want to get picked for the basketball team, you’re looking to be taller than other people; increasing everyone’s height by 3 inches doesn’t uniquely benefit you, as your relative position and desirability has remained the same. On a related note, if you are doing well on some absolute metric but could be doing better, remaining content with one’s lot in life and forgoing those additional benefits is not the type of psychology one would predict to have proven adaptive. All else being equal, the male satisfied with a single mate that foregoes an additional one will be out-reproduced by the male who takes the second as well. Examples like these help to highlight the positional aspects of human satisfaction: even though some degree of our day-to-day lives are no doubt generally happier because people aren’t dying from smallpox and we have cell phones, people are often less happy than we might expect because so much of that happiness is not determined by one’s absolute state. Instead, our happiness is determined by our relative state: how good we could be doing relative to our current status, and how much we offer socially, relative to others.

A similar logic was applied in a recent paper by Conroy-Beam, Goetz, & Buss (2016) that examined people’s relationship satisfaction. The researchers were interested in testing the hypothesis that it’s not about how well one’s partner matches their ideal preferences on some absolute threshold when it comes to relationship satisfaction; instead, partner satisfaction is more likely to be a product of (a) whether more attractive alternative partners are available and (b) whether one is desirable enough to attract one of them. One might say that people are less concerned with how much they like their spouse and more concerned with whether they could get a better possible spouse: if one can move up in the dating world, then their satisfaction with their current partner should be relatively low; if one can’t move up, they ought to be satisfied with what they already have. After all, it makes little sense to abandon your mate for not meeting your preferences if your other options are worse.

These hypotheses were tested in a rather elegant and unique way across three studies, all of which utilized a broadly-similar methodology (though I’ll only be discussing two). The core of each involved participants who were currently in relationships completing four measures: one concerning how important 27 traits would be in an ideal mate (on a 7-point scale), another concerning how well those same traits described their current partner, a third regarding how those traits described themselves, and finally rating their relationship satisfaction.

To determine how well a participant’s current partner fulfilled their preferences, the squared difference between the participant’s ideal and actual partner was summed for all 27 traits and then the square root of that value was taken. This process generated a single number that provided a sense for how far off from some ideal an actual partner was across a large number of traits: the larger this number, the worse of a fit the actual partner was. A similar transformation was then carried out with respect to how all the other participants rated their partners on those traits. In other words, the authors calculated what percentage of other people’s actual mates fit the preferences of each participant better than their current partner. Finally, the authors calculated the discrepancy in mate value between the participant and their partner. This was done in a three-step process, the gist of which is that they calculated how well the participant and their partner met the average ideals of the opposite sex. If you are closer to the average ideal partner of the opposite sex than your partner, you have the higher mate value (i.e., are more desirable to others); if you are further away, you have the lower mate value.

 It’s just that simple!

In the interests of weeding out the mathematical complexity, there were three values calculated. Assuming you were taking the survey, they would correspond to (1) how well your actual partner matched your ideal (2) what percent of possible real mates out in the world are better overall fits, and (3) how much more or less desirable you are to others, relative to your partner. These values were then plugged into a regression predicting relationship satisfaction. As it turned out, in the first study (N = 260), the first value – how well one’s partner matched their ideal – barely predicted relationship satisfaction at all (ß = .06); by contrast, the number of other potential people who might make better fits was a much stronger predictor (ß = -.53), as was the difference in relative mate value between the participant and their partner (ß = .11). There was also an interaction between these latter two values (ß = .21). As the authors summarized these results:

Participants lower in mate value than their partners were generally satisfied regardless of the pool of potential mates; participants higher in mate value than their partners became increasingly dissatisfied with their relationships as better alternative partners became available”

So, if your partner is already more attractive than you, then you probably consider yourself pretty lucky. Even if there are a great number of better possible partners out there for you, you’re not likely to be able to attract them (you got lucky once dating up; better to not try your luck a second time). By contrast, if you are more attractive than your partner, then it might make sense to start looking around for better options. If few alternatives exist, you might want to stick around; if many do, then switching might be beneficial.

The second study addressed the point that partners in these relationships are not passive bystanders when it comes to being dumped; they’re wary about the possibility of their partner seeking greener pastures. For instance, if you understand that your partner is more attractive than you, you likely also understand (at least intuitively) that they might try to find someone who suits them better than you do (because they have that option). If you view being dumped as a bad thing (perhaps because you can’t do better than your current partner) you might try to do more to keep them around. Translating that into a survey, Conroy et al (2016) asked participants to indicate how often they engaged in 38 mate retention tactics over the course of the past year. These include a broad range of behaviors, including calling to check up on one’s partner, asking to deepen commitment to them, derogating potential alternative mates, buying gifts, or performing sexual favors, among others. Participants also filled out the mate preference measures as before.

The results from the first study regarding satisfaction were replicated. Additionally, as expected, there was a positive relationship between these retention behaviors and relationship satisfaction (ß = .20): the more satisfied one was with their partner, the more they behaved in ways that might help keep them around. There was also a negative relationship between trust and these mate retention behaviors (ß = -.38): the less one trusted their partner, the more they behaved in ways that might discourage them from leaving. While that might sound strange at first – why encourage someone you don’t trust to stick around? – it is fairly easy to understand to the extent that the perceptions of partner trust are intuitively tracking the probability that your partner can do better than you: it’s easier to trust someone who doesn’t have alternatives than it is to trust one who might be tempted.

It’s much easier avoid sinning when you don’t live around an orchard

Overall, I found this research an ingenious way to examine relationship satisfaction and partner fit across a wide range of different traits. There are, of course, some shortcomings to the paper which the authors do mention, including the fact that all the traits were given equal weighting (meaning that the fit for “intelligent” would be rated as being as important as the fit for “dominant” when determining how well your partner suited you) and the pool of potential mates was not considered in the context of a local sample (that is, it matters less if people across the country fit your ideal better than your current mate, relative to if people in your immediate vicinity do). However, given the fairly universal features of human mating psychology and the strength of the obtained results, these do not strike me as fatal to the design in any way; if anything, they raise the prospect that the predictive strength of this approach could actually be improved by tailoring it to specific populations.

References: Conroy-Beam, D., Goetz, C., & Buss, D. (2016). What predicts romantic relationship satisfaction and mate retention intensity: mate preference fulfillment or mate value discrepancies? Evolution & Human Behavior, DOI: http://dx.doi.org/10.1016/j.evolhumbehav.2016.04.003

Psychology Research And Advocacy

I get the sense that many people get a degree in psychology because they’re looking to help others (since most clearly aren’t doing it for the pay). For those who get a degree in the clinical side of the field, this observation seems easy to make; at the very least, I don’t know of any counselors or therapists who seek to make their clients feel worse about the state their life is in and keep them there. For those who become involved in the research end of psychology, I believe this desire to help others is still a major motivator. Rather than trying to help specific clients, however, many psychological researchers are driven by a motivation to help particular groups in society: women, certain racial groups, the sexually promiscuous, the outliers, the politically liberal, or any group that the researcher believes to be unfairly marginalized, undervalued, or maligned. Their work is driven by a desire to show that the particular group in question has been misjudged by others, with those doing the misjudging being biased and, importantly, wrong. In other words, their role as a researcher is often driven by their role as an advocate, and the quality of their work and thinking can often take a back seat to their social goals.

When megaphones fail, try using research to make yourself louder

Two such examples are highlighted in a recent paper by Eagly (2016), both of which can broadly be considered to focus on the topic of diversity in the workplace. I want to summarize them quickly before turning to some of the other facets of the paper I find noteworthy. The first case concerns the prospect that having more women on corporate boards tends to increase their profitability, a point driven by a finding that Fortune 500 companies in the top quarter of female representation on boards of directors performed better than those in the bottom quarter of representation. Eagly (2016) rightly notes that such a basic data set would be all but unpublishable in academia for failing to do a lot of important things. Indeed, when more sophisticated research was considered in a meta-analysis of 140 studies, the gender diversity of the board of directors had about as close to no effect as possible on financial outcomes: the average correlations across all the studies ranged from about r = .01 all the way up to r = .05 depending on what measures were considered. Gender diversity per se seemed to have no meaningful effect despite a variety of advocacy sources claiming that increasing female representation would provide financial benefits. Rather than considering the full scope of the research, the advocates tended to cite only the most simplistic analyses that provided the conclusion they wanted (others) to hear.

The second area of research concerned how demographic diversity in work groups can affect performance. The general assumption that is often made about diversity is that it is a positive force for improving outcomes, given that a more cognitively-varied group of people can bring a greater number of skills and perspectives to bear on solving tasks than more homogeneous groups can. As it turns out, however, another meta-analysis of 146 studies concluded that demographic diversity (both in terms of gender and racial makeup) had effectively no impact on performance outcomes: the correlation for gender was r = -.01 and was r = -.05 for racial diversity. By contrast, differences in skill sets and knowledge had a positive, but still very small effect (r = .05). In summary, findings like these would suggest that groups don’t get better at solving problems just because they’re made up of enough [men/women/Blacks/Whites/Asians/etc]. Diversity in demographics per se, unsurprisingly, doesn’t help to magically solve complex problems.

While Eagly (2016) appears to generally be condemning the role of advocacy in research when it comes to getting things right (a laudable position), there were some passages in the paper that caught my eye. The first of these concerns what advocates for causes should do when the research, taken as a whole, doesn’t exactly agree with their preferred stance. In this case, Eagly (2016) focuses on the diversity research that did not show good evidence for diverse groups leading to positive outcomes. The first route one might take is to simply misrepresent the state of the research, which is obviously a bad idea. Instead, Eagly suggests advocates take one of two alternative routes: first, she recommends that researchers might conduct research into more specific conditions under which diversity (or whatever one’s preferred topic is) might be a good thing. This is an interesting suggestion to evaluate: on the one hand, people would often be inclined to say it’s a good idea; in some particular contexts diversity might be a good thing, even if it’s not always, or even generally, useful. This wouldn’t be the first time effects in psychology are found to be context-dependent. On the other hand, this suggestion also runs some serious risks of inflating type 1 errors. Specifically, if you keep slicing up data and looking at the issue in a number of different contexts, you will eventually uncover positive results even if they’re just due to chance. Repeated subgroup or subcontext analysis doesn’t sound much different from the questionable statistical practices currently being blamed for psychology’s replication problem: just keep conducting research and only report the parts of it that happened to work, or keep massaging the data until the right conclusion falls out.    

“…the rest goes in the dumpster out back”

Eagly’s second suggestion I find a bit more worrisome: arguing that relevant factors – like increases in profits, productivity, or finding better solutions – aren’t actually all that relevant when it comes to justifying why companies should increase diversity. What I find odd about this is that it seems to suggest that the advocates begin with their conclusion (in this case, that diversity in the work force ought to be increased) and then just keep looking for ways to justify it in spite of previous failures to do so. Again, while it is possible that there are benefits to diversity which aren’t yet being considered in the literature, bad research would likely result from a process where someone starts their analysis with the conclusion and keeps going until they justify it to others, no matter how often it requires shifting the goal posts. A major problematic implication with that suggestion mirrors other aspects of the questionable psychology research practices I mentioned before: when a researcher finds the conclusion they’re looking for, they stop looking. They only collect data up until the point it is useful, which rigs the system in favor of finding positive results where there are none. That could well mean, then, that there will be negative consequences to these diversity policies which are not being considered. 

What I think is a good example of this justification problem leading to shoddy research practices/interpretation follows shortly thereafter. In talking about some of these alternative benefits that more female hires might have, Eagly (2016) notes that women tend to be more compassionate and egalitarian than men; as such, hiring more women should be expected to increase less-considered benefits, such as a reduction in the laying-off of employees during economic downturns (referred to as labor hoarding), or more favorable policies towards time off for family care. Now something like this should be expected: if you have different people making the decisions, different decisions will be made. Forgoing for the moment the question of whether those different policies are better, in some objective sense of the word, if one is interested in encouraging those outcomes (that is, they’re preferred by the advocate) then one might wish to address those issue directly, rather than by proxy. That is to say if you are looking to make the leadership of some company more compassionate, then it makes sense to test for and hire more compassionate people, not hiring more women under the assumption you will be increasing compassion. 

This is an important matter because people are not perfect statistical representations of the groups to which they belong. On average, women may be more compassionate than men; the type of woman who is interested in actively pursuing a CEO position in a Fortune 500 company might not be as compassionate as your average woman, however, and, in fact, might even be less compassionate than a particular male candidate. What Eagly (2016) has ended up reaching, then, is not a justification for hiring more women; it’s a justification for hiring compassionate or egalitarian people. What is conspicuously absent from this section is a call for more research to be conducted on contexts in which men might be more compassionate than women; once the conclusion that hiring women is a good thing has been justified (in the advocate’s mind, anyway), the concerns for more information seem to sputter out. It should go without saying, but such a course of action wouldn’t be expected to lead to the most accurate scientific understanding of our world.

The solution to that problem being more diversity, of course..

To place this point in another quick example, if you’re looking to assemble a group of tall people, it would be better to use people’s height when making that decision rather than their sex, even if men do tend to be taller than women. Some advocates might suggest that being male is a good enough proxy for height, so you should favor male candidates; others would suggest that you shouldn’t be trying to assemble a group of tall people in the first place, as short people offer benefits that tall ones don’t; other still will argue that it doesn’t matter if short people don’t offer benefits as they should be preferentially selected to combat negative attitudes towards the short regardless (at the expense of selecting tall candidates). For what it’s worth, I find the attitude of “keep doing research until you justify your predetermined conclusion” to be unproductive and indicative of why the relationship between advocates and researchers ought not be a close one. Advocacy can only serve as a cognitive constraint that decreases research quality as the goal of advocacy is decidedly not truth. Advocates should update their conclusions in light of the research; not vice versa. 

References: Eagly, A. (2016). When passionate advocates meet research on diversity, does the honest broker stand a chance? Journal of Social Issues, 72, 199-222.

More About Psychology Research Replicating

By now, many of you have no doubt heard about the reproducibility project, where 100 psychological findings were subjected to replication attempts. In case you’re not familiar with it, the results of this project were less than a ringing endorsement of research in the field: of the expected 89 replications, only 37 were obtained and the average size of the effects fell dramatically; social psychology research in particular seemed uniquely bad in this regard. This suggests that, in many cases, one would be well served by taking many psychological findings with a couple grains of salt. Naturally, this leads many people to wonder whether there’s anyway they might be more confident that an effect is real, so to speak. One possible means through which your confidence might be bolstered is whether or not the research in question contains conceptual replications. What this refers to are cases where the authors of a manuscript report the results of several different studies purporting to measure the same underlying thing with varying methods; that is, they are studying topic A with methods X, Y, and Z. If all of these turn up positive, you ought to be more confident that an effect is real. Indeed, I have had a paper rejected more than once for only containing a single experiment. Journals often want to see several studies in one paper, and that is likely part of the reason why: a single experiment is surely less reliable than multiple ones.

It doesn’t go anywhere, but at least it does so reliably

According to the unknown moderator account of replication failure, psychological research findings are, in essence, often fickle. Some findings might depend on the time of day that measurements were taken, the country of the sample, some particular detail of the stimulus material, whether the experimenter is a man or a woman; you name it. In other words, it is possible that these published effects are real, but only occur in some rather specific contexts of which we are not adequately aware; that is to say they are moderated by unknown variables. If that’s the case, it is unlikely that some replication efforts will be successful, as it is quite unlikely that all of the unique, unknown, and unappreciated moderators will be replicated as well. This is where conceptual replications come in: if a paper contains two, three, or more different attempts at studying the same topic, we should expect that the effect they turn up is more likely to extend beyond a very limited set of contexts and should replicate more readily.

That’s a flattering hypothesis for explaining these replication failures; there’s just not enough replication going on prepublication, so limited findings are getting published as if they were more generalizable. The less-flattering hypothesis is that many researchers are, for lack of a better word, cheating by employing dishonest research tactics. These tactics can include hypothesizing after data is collected, only collecting participants until the data says what the researchers want and then stopping, splitting samples up into different groups until differences are discovered, and so on. There’s also the notorious issue of journals only publishing positive results rather than negative ones (creating a large incentive to cheat, as punishment for doing so is all but non-existent so long as you aren’t just making up the data). It is for these reasons that requiring the pre-registering of research – explicitly stating what you’re going to look at ahead of time – drops positive findings markedly. If research is failing to replicate because the system is being cheated, more internal replications (those from the same authors) don’t really help that much when it comes to predicting external replications (those conducted by outside parties). Internal replications just provide researchers the ability to report multiple attempts at cheating.

These two hypotheses make different predictions concerning the data from the aforementioned reproducibility project: specifically, research containing internal replications ought to be more likely to successfully replicate if the unknown moderator hypothesis is accurate. It certainly would be a strange state of affairs from a “this finding is true” perspective if multiple conceptual replications were no more likely to prove reproducible than single-study papers. It would be similar to saying that effects which have been replicated are no more likely to subsequently replicate than effects which have not. By contrast, the cheating hypothesis (or, more politely, questionable research practices hypothesis) has no problem at all with the idea that internal replications might prove to be as externally replicable as single-study papers; cheating a finding out three times doesn’t mean it’s more likely to be true than cheating it out once.

It’s not cheating; it’s just a “questionable testing strategy”

This brings me to a new paper by Kunert (2016) who reexamined some of the data from the reproducibility project. Of the 100 original papers, 44 contained internal replications: 20 contained just one replication, 10 were replicated twice, 9 were replicated 3 times, and 5 contained more than three. These were compared against the 56 papers which did not contain internal replications to see which would subsequently replicate better (as measured by achieving statistical significance). As it turned out, papers with internal replications externally replicated about 30% of the time, whereas papers without internal replications externally replicated about 40% of the time. Not only were the internally-replicated papers not substantially better, they were actually slightly worse in that regard. A similar conclusion was reached regarding the average effect size: papers with internal replications were no more likely to subsequently contain a larger effect size, relative to papers without such replications.

It is possible, of course, that papers containing internal replications are different than papers which do not contain such replications. This means it might be possible that internal replications are actually a good thing, but their positive effects are being outweighed by other, negative factors. For example, someone proposing a particularly novel hypothesis might be inclined to include more internal replications in their paper than someone studying an established one; the latter researcher doesn’t need more replications in his paper to get it published because the effect has already been replicated in other work. Towards examining this point, Kunert (2016) made use of the 7 identified reproducibility predictors from the Open Science Collaboration – field of study, effect type, original P-value, original effect size, replication power, surprisingness of original effect, and the challenge of conducting the replication – to assess whether internally-replicated work differed in any notable ways from the non-internally-replicated sample. As it turns out, the two samples were pretty similar overall on all the factors except one: field of study. Internally-replicated effects tended to come from social psychology more frequently (70%) than cognitive psychology (54%). As I mentioned before, social psychology papers did tend to replicate less often. However, the unknown moderator effect was not particularly well supported for either field when examined individually.

In summary, then, papers containing internal replications were no more likely to do well when it came to external replications which, in my mind, suggests that something is going very wrong in the process somewhere. Perhaps researchers are making use of their freedom to analyze and collect data as they see fit in order deliver the conclusions they want to see; perhaps journals are preferentially publishing the findings of people who got lucky, relative to those who got it right. These possibilities, of course, are not mutually exclusive. Now I suppose one could continue to make an argument that goes something like, “papers that contain conceptual replications are more likely to be doing something else different, relative to papers with only a single study,” which could potentially explain the lack of strength provided by internal replications, and whatever that “something” is might not be directly tapped by the variables considered in the current paper. In essence, such an argument would suggest that there are unknown moderators all the way down.

“…and that turtle stands on the shell of an even larger turtle…”

While it’s true enough that such an explanation is not ruled out by the current results, it should not be taken as any kind of default stance on why this research is failing to replicate. The “researchers are cheating” explanation strikes me as a bit more plausible at this stage, given that there aren’t many other obvious explanations for why ostensibly replicated papers are no better at replicating. As Kunert (2016) plainly puts it:

This report suggests that, without widespread changes to psychological science, it will become difficult to distinguish it from informal observations, anecdotes and guess work.

This brings us to the matter of what might be done about the issue. There are procedural ways of attempting to address the problem – such as Kunert’s (2016) recommendation for getting journals to publish papers independent of their results – but my focus has, and continues to be, on the theoretical aspects of publication. Too many papers in psychology get published without any apparent need for the researchers to explain their findings in any meaningful sense; instead, they usually just restate and label their findings, or they posit some biologically-implausible function for what they found. Without the serious and consistent application of evolutionary theory to psychological research, implausible effects will continue to be published and subsequently fail to replicate because there’s otherwise little way to tell whether a finding makes sense. By contrast, I find it plausible that unlikely effects can be more plainly spotted – by reviewers, readers, and replicators – if they are all couched within the same theoretical framework; even better, the problems in design can be more easily identified and rectified by considering the underlying functional logic, leading to productive future research.  

References: Kunert, R. (2016). Internal conceptual replications do not increase independent replication success. Psychological Bulletin Review, DOI 10.3758/s13423-016-1030-9

Morality, Alliances, And Altruism

Having one’s research ideas scooped is part of academic life. Today, for instance, I’d like to talk about some research quite similar in spirit to work I intended to do as part of my dissertation (but did not, as it didn’t end up making the cut in the final approved package). Even if my name isn’t on it, it is still pleasing to see the results I had anticipated. The idea itself arose about four years ago, when I was discussing the curious case of Tucker Max’s donation to Planned Parenthood being (eventually) rejected by the organization. To quickly recap, Tucker was attempting to donate half-a-million dollars to the organization, essentially receiving little more than a plaque in return. However, the donation was rejected, it would seem, under fear of building an association between the organization and Tucker, as some people perceived Tucker to be a less-than-desirable social asset. This, of course, is rather strange behavior, and we would recognize it as such if it were observed in any other species (e.g., “this cheetah refused a free meal for her and her cubs because the wrong cheetah was offering it”); refusing free benefits is just peculiar.

“Too rich for my blood…”

As it turns out, this pattern of behavior is not unique to the Tucker Max case (or the Kim Kardashian one…); it has recently been empirically demonstrated by Tasimi & Wynn (2016), who examined how children respond to altruistic offers from others, contingent on the moral character of said others. In their first experiment, 160 children between the ages of 5 and 8 were recruited to make an easy decision; they were shown two pictures of people and told that the people in the pictures wanted to give them stickers, and they had to pick which one they wanted to receive the stickers from. In the baseline conditions, one person was offering 1 sticker, while the other was offering either 2, 4, 8, or 16 stickers. As such, it should come as no surprise that the person offering more stickers was almost universally preferred (71 of the 80 children wanted the person offering more, regardless of how many more).

Now that we’ve established that more is better, we can consider what happened in the second condition where the children received character information about their benefactors. One of the individuals was said to always be mean, having hit someone the other day while playing; the other was said to always be nice, having hugged someone the other day instead. The mean person was always offering more stickers than the nice one. In this condition, the children tended to shun the larger quantity of stickers in most cases: when the sticker ratio was 2:1, less than 25% of children accepted the larger offer from the mean person; the 4:1 and 8:1 ratios were accepted about 40% of the time, and the 16:1 ratio 65% of the time. While more is better in general, it is apparently not better enough for children to overlook the character information at times. People appear willing to forgo receiving altruism when it’s coming from the wrong type of person. Fascinating stuff, especially when one considers that such refusals end up leaving the wrongdoers with more resources than they would otherwise have (if you think someone is mean, wouldn’t you be better off taking those resources from them, rather than letting them keep them?).

This line was replicated in 64 very young children (approximately one-year old). In this experiment, the children observed a puppet show in which two puppets offered them crackers, with one offering a single cracker and the other offering either 2 or 8. Again, unsurprisingly, the majority of children accepted the larger offer, regardless of how much larger it was (24 of 32 children). In the character information condition, one puppet was shown to be a helper, assisting another puppet in retrieving a toy from a chest, whereas the other puppet was a hinderer, preventing another from retrieving a toy. The hindering puppet, as before, now offered the greater number of crackers, whereas the helper only offered one cracker. When the hindering puppet was offering 8 crackers, his offer was accepted about 70% of the time, which did not differ from the baseline group. However, when the hindering puppet was only offering 2, the acceptance rate was a mere 19%. Even young children, it would seem, are willing to avoid accepting altruism from wrongdoers, assuming the difference in offers isn’t too large.

“He’s not such a bad guy once you get $10 from him”

While neat, these results beg for a deeper explanation as to why we should expect such altruism to be rejected. I believe hints of this explanation are provided by the way Tasimi & Wynn (2016) write about their results:

Taken together, these findings indicate that when the stakes are modest, children show a strong tendency to go against their baseline desire to optimize gain to avoid ‘‘doing business” with a wrongdoer; however, when the stakes are high, children show more willingness to ‘‘deal with the devil…”

What I find strange about that passage is that children in the current experiments were not “doing business” or “making deals” with the altruists; there was no quid pro quo going on. The children were no more doing business with the others than they are doing business with a breastfeeding mother. Nevertheless, there appears to an implicit assumption being made here: an individual who accepts altruism from another is expected to pay that altruism back in the future. In other words, merely receiving altruism from another generates the perception of a social association between the donor and recipient.

This creates an uncomfortable situation for the recipient in cases where the donor has enemies. Those enemies are often interested in inflicting costs on the donor or, at the very least, withholding benefits from him. In the latter case, this makes that social association with the donor less beneficial than it otherwise might, since the donor will have fewer expected future resources to invest in others if others don’t help him; in the former case, not only does the previous logic hold, but the enemies of your donor might begin to inflict costs on you as well, so as to dissuade you from helping him. Putting this into a quick example Jon – your friend – goes out an hurts Bob, say, by sleeping with Bob’s wife. Bob and his friends, in response, both withhold altruism from Jon (as punishment) and might even be inclined to attack him for his transgression. If they perceive you as helping Jon – either by providing him with benefits or by preventing them from hurting Jon – they might be inclined to withhold benefits from or punish you as well until you stop helping Jon as a means of indirect punishment. To turn the classic phrase, the friend of my enemy is also my enemy (just as the enemy of my enemy is my friend).

What cues might they use to determine if you’re Jon’s ally? Well, one likely useful cue is whether Bob directs altruism towards you. If you are accepting his altruism, this is probably a good indication that you will be inclined to reciprocate it later (else risk being labeled a social cheater or free rider). If you wish to avoid condemnation and punishment by proxy, then, one route to take is to refuse benefits from questionable sources. This risk can be overcome, however, in cases where the morally-questionable donor is providing you a large enough benefit which, indeed, was precisely the pattern of results observed here. What will determine what counts as “large enough” should be expected to vary as a function of a few things, most notably the size and nature of the transgressions, as well as the degree of expected reciprocity. For example, receiving large donations from morally-questionable donors should be expected to be more acceptable to the extent the donation is made anonymously vs publicly, as anonymity might reduce the perceived social associations between donor and recipient.

You might also try only using “morally clean” money

Importantly (as far as I’m concerned) this data fits well within my theory of morality – where morality is hypothesized to function as an association-management mechanism – but not particularly well with other accounts: altruistic accounts of morality should predict that more altruism is still better, dynamic coordination says nothing about accepting altruism, as giving isn’t morally condemned, and self-interest/mutualistic accounts would, I think, also suggest that taking more money would still be preferable since you’re not trying to dissuade others from giving. While I can’t help but feel some disappointment that I didn’t carry this research out myself, I am both happy with the results that came of it and satisfied with the methods utilized by the authors. Getting research ideas scooped isn’t so bad when they turn out well anyway; I’m just happy enough to see my main theory supported.  

References: Tasimi, A. & Wynn, K. (2016). Costly rejection of wrongdoers by infants and children. Cognition, 151, 76-79.

Benefiting Others: Motives Or Ends?

The world is full of needy people; they need places to live, food to eat, medical care to combat biological threats, and, if you ask certain populations in the first world, a college education. Plenty of ink has been spilled over the matter of how to best meet the needs of others, typically with a focus on uniquely needy populations, such as the homeless, poverty-stricken, sick, and those otherwise severely disadvantaged. In order to make meaningful progress in such discussions, there arises the matter of precisely why - in the functional sense of the word – people are interested in helping others, as I believe the answer(s) to that question will be greatly informative when it comes to determining the most effective strategies for doing so. What is very interesting about these discussions is that the focus is frequently placed on helping others altruistically; delivering benefits to others in ways that are costly for the person doing the helping. The typical example of this involves charitable donations, where I would give up some of my money so that someone else can benefit. What is interesting about this focus is that our altruistic systems often seem to face quite a bit of pushback from other parts of our psychology when it comes to helping others, resulting in fairly poor deliveries of benefits. It represents a focus on the means by which we help others, rather than really serving to improve the ends of effective helping. 

For instance. this sign isn’t asking for donations

As a matter of fact, the most common ways of improving the lives of others doesn’t involve any altruism at all. For an alternative focus, we might consider the classic Adam Smith quote pertaining to butchers and bakers:

But man has almost constant occasion for the help of his brethren, and it is in vain for him to expect it from their benevolence only. He will be more likely to prevail if he can interest their self-love in his favour, and show them that it is for their own advantage to do for him what he requires of them. Whoever offers to another a bargain of any kind, proposes to do this. Give me that which I want, and you shall have this which you want, is the meaning of every such offer; and it is in this manner that we obtain from one another the far greater part of those good offices which we stand in need of. It is not from the benevolence of the butcher, the brewer, or the baker that we expect our dinner, but from their regard to their own interest.

In short, Smith appears to recommend that, if we wish to effectively meet the needs of others (or have them meet our needs), we must properly incentivize that other-benefiting behavior instead of just hoping people will be willing to continuously suffer costs. Smith’s system, then, is more mutualistic or reciprocal in nature. There are a lot of benefits to trying to use these mutualistic and reciprocally-altruistic cognitive mechanisms, rather than altruistic ones, some of which I outlined last week. Specifically, altruistic systems typically direct benefits preferentially towards kin and social allies, and such a provincial focus is unlikely to deliver benefits to the needy individuals in the wider world particularly well (e.g., people who aren’t kin or allies). If, however, you get people to behave in a way that benefits themselves and just so happen to benefit others as a result, you’ll often end up with some pretty good benefit delivery. This is because you don’t need to coerce people into helping themselves.  

So let’s say we’re faced with a very real-world problem: there is a general shortage of organs available for people in need of transplants. What cognitive systems do we want to engage to solve that problem? We could, as some might suggest, make people more empathetic to the plight of those suffering in hospitals, dying from organ failure; we might also try to convince people that signing up as an organ donor is the morally-virtuous thing to do. Both of these plans might increase the number of people willing to posthumously donate their organs, but perhaps there are much easier and effective ways to get people to become organ donors even if they have no particular interest in helping others. I wanted to review two such candidate methods today, neither of which require that people’s altruistic cognitive systems be particular engaged.

The first method comes to us from Johnson & Goldstein (2003), who examine some cross-national data on rates of organ donor status. Specifically, they note an oddity in the data: very large and stable differences exist between nations in organ donor status, even after controlling for a number of potentially-relevant variables. Might these different rates exist because of people’s preferences for being an organ donor varying markedly between countries? It seems unlikely, unless people in Germany have an exceedingly unpopular opinion toward being an organ donor (14% are donors, from the figures cited), while people in Sweden are particularly interested in it (86%). In fact, in the US, support for organ donation is at near ceiling levels, yet a large gap persists between those who support it (95%) and those who indicated on a driver’s license they were donors (51% in 2005; 60% in 2015) or who had signed a donor card (30%). If it’s not people’s lack of support for such a policy, what is explaining the difference?

A poor national sense for graphic design?

Johnson & Goldstein (2003) float a simple explanation for most of the national differences: whether donor programs were opt-in or opt-out. What that refers to is the matter of, assuming someone has made no explicit decision as to what happens to their organs after they die, what decision would be treated as the default? In opt-in countries (like Germany and the US), non-donor status would be assumed unless someone signs up to be a donor; in opt-out countries, like Sweden, people are assumed to be donors unless they indicate that they do not wish to be one. As the authors report, the opt-in countries have much lower effective consent rates (on average, 60% lower) and the two groups represent non-overlapping populations. That data supplements the other experimental findings from Johnson & Goldstein (2003) as well. The authors had 161 participants take part in an experiment where they were asked to imagine they had moved to a new state. This state either treated organ donation as the default option or non-donation as the default, and participants were asked whether they would like to confirm or change their status. There was also a third condition where no default answer was provided. When no default answer was given, 79% of participants said they would be willing to be an organ donor; a percentage which did not differ from those who confirmed their donor status when it was the default (82%). However, when non-donor status was the default, only 42% of the participants changed their status to donor. 

So defaults seem to matter quite a bit, but let’s assume that a nation isn’t going to change its policy from opt-in to opt-out anytime soon. What else might we do if we wanted to improve the rates of people signing up to be an organ donor in the short term? Eyting et al (2016) tested a rather simple method: paying people €10. The researchers recruited 320 German university students who did not currently have an organ donor card and provided them the opportunity to fill one out. These participants were split into three groups: one in which there was no compensation offered for filling out the card, one in which they would personally receive €10 for filling out a card (regardless of which choice they picked: donor or non-donor), and a final condition in which €10 would be donated to a charitable organization (the Red Cross) if they filled out a card. No differences were observed between the percentage of participants who filled out the card between the control (35%) and charity (36%) conditions. However, in the personal benefit group, there was a spike in the number of people filling out the card (72%). Not all those who filled out the cards opted for donor status, though. Between conditions, the percentage of people who both (a) filled out the card and (b) indicated they wanted to be a donor where about 44% in the personal payment condition, 28% in the control condition, and only 19% in the charity group. Not only did the charity appeal not seem particularly effective, it was even nominally counterproductive.

“I already donated $10 to charity and now they want my organs too?!”

Now, admittedly, helping others because there’s something in it for you isn’t quite as sexy (figuratively speaking) as helping because you’re driven by an overwhelming sense of empathy, conscience, or simply helping for no benefit at all. This is because there’s a lower signal value in that kind of self-beneficial helping; it doesn’t predict future behavior in the absence of those benefits. As such, it’s unlikely to be particularly effective at building meaningful social connections between helpers and others. However, if the current data is any indication, such helping is also likely to be consistently effective. If one’s goal is to increase the benefits being delivered to others (rather than building social connections), that will often involve providing valued incentives for the people doing the helping.

On one final note, it’s worth mentioning that these papers only deal with people becoming a donor after death; not the prospect of donating organs while alive. If one wanted to, say, incentivize someone to donate a kidney while alive, a good way to do so might be to offer them money; that is, allow people to buy and sell organs they are already capable of donating. If people were allowed to engage in mutually-beneficial interactions when it came to selling organs, it is likely we would see certain organ shortages decrease as well. Unfortunately for those in need of organs and/or money, our moral systems often oppose this course of action (Tetlock, 2000), likely contingent on perceptions about which groups would be benefiting the most. I think this serves as yet another demonstration that our moral sense might not be well-suited for maximizing the welfare of people in the wider social world, much like our empathetic systems don’t.

References: Eyting, M., Hosemann, A., & Johannesson, M. (2016). Can monetary incentives increase organ donations? Economics Letters, 142, 56-58.

Johnson, E. & Goldstein, D. (2003). Do defaults save lives? Science, 132, 1338-1339.

Tetlock, P. (2000). Coping with trade-offs: Psychological constraints and political implications. In Elements of Reason: Cognition, Choice, & the Bounds of Rationality. Ed. Lupia, A., McCubbins, M., & Popkin, S. 239-322.  

Morality, Empathy, And The Value Of Theory

Let’s solve a problem together: I have some raw ingredients that I would like to transform into my dinner. I’ve already managed to prepare and combine the ingredients, so all I have left to do is cook them. How am I to solve this problem of cooking my food? Well, I need a good source of heat. Right now, my best plan is to get in my car and drive around for a bit, as I have noticed that, after I have been driving for some time, the engine in my car gets quite hot. I figure I can use the heat generated by driving to cook my food. It would come as no surprise to anyone if you have a couple of objections with my suggestion, mostly focused on the point that cars were never designed to solve the problems posed by cooking. Sure, they do generate heat, but that’s really more of a byproduct of their intended function. Further, the heat they do produce isn’t particularly well-controlled or evenly-distributed. Depending on how I position my ingredients or the temperature they require, I might end up with a partially-burnt, partially-raw dinner that is likely also full of oil, gravel, and other debris that has been kicked up into the engine. Not only is the car engine not very efficient at cooking, then, it’s also not very sanitary. You’d probably recommend that I try using a stove or oven instead.

“I’m not convinced. Get me another pound of bacon; I’m going to try again”

Admittedly, this example is egregious in its silliness, but it does make its point well: while I noted that my car produces heat, I misunderstood the function of the device more generally and tried to use it to solve a problem inappropriately as a result. The same logic also holds in cases where you’re dealing with evolved cognitive mechanisms. I examined such an issue recently, noting that punishment doesn’t seem to do a good job as a mechanism for inspiring trust, at least not relative to its alternatives. Today I wanted to take another run at the underlying issue of matching proximate problem to adaptive function, this time examining a different context: directing aid to the great number of people around the world who need altruism to stave off death and non-lethal, but still quite severe, suffering (issues like alleviating malnutrition and infectious diseases). If you want to inspire people to increase the amount of altruism directed towards these needy populations, you will need to appeal to some component parts of our psychology, so what parts should those be?

The first step in solving this problem is to think about what cognitive systems might increase the amount of altruism directed towards others, and then examine the adaptive function of each to determine whether they will solve the problem particularly efficiently. Paul Bloom attempted a similar analysis (about three years ago, but I’m just reading it now), arguing that empathetic cognitive systems seem like a poor fit for the global altruism problem. Specifically, Bloom makes the case that empathy seems more suited to dealing with single-target instances of altruism, rather than large-scale projects. Empathy, he writes, requires an identifiable victim, as people are giving (at least proximately) because they identify with the particular target and feel their pain. This becomes a problem, however, when you are talking about a population of 100 or 1000 people, since we simply can’t identify with that many targets at the same time. Our empathetic systems weren’t designed to work that way and, as such, augmenting their outputs somehow is unlikely to lead to a productive solution to the resource problems plaguing certain populations. Rather than cause us to give more effectively to those in need, these systems might instead lead us to over-invest further in a single target. Though Bloom isn’t explicit on this point, I feel he would likely agree that this has something to do with empathetic systems not having evolved because they solved the problems of others per se, but rather because they did things like help the empathetic person build relationships with specific targets, or signal their qualities as an associate to those observing the altruistic behavior.

Nothing about that analysis strikes me as distinctly wrong. However, provided I have understood his meaning properly, Bloom goes on to suggest that the matter of helping others involves the engagement of our moral systems instead (as he explains in this video, he believes empathy “fundamentally…makes the world worse,” in the moral sense of the term, and he also writes that there’s more to morality – in this case, helping others – than empathy). The real problem with this idea is that our moral systems are not altruistic systems, even if they do contain altruistic components (in much the same way that my car is not a cooking mechanism even if it does generate heat). This can be summed up in a number of ways, but simplest is in a study by Kurzban, DeScioli, & Fein (2012) in which participants were presented with the footbridge dilemma (“Would you push one person in front of a train – killing them – to save five people from getting killed by it in turn?”). If one was interested in being an effective altruist in the sense of delivering the greatest number of benefits to others, pushing is definitely the way to go under the simple logic that five lives saved is better than one life spared (assuming all lives have equal value). Our moral systems typically oppose this conclusion, however, suggesting that saving the lives of the five is impermissible if it means we need to kill the one. What is noteworthy about the Kurzban et al (2012) paper is that you can increase people’s willingness to push the one if the people in the dilemma (both being pushed and saved) are kin.

Family always has your back in that way…

The reason for this increase in pushing when dealing with kin, rather than strangers, seems to have something to do with our altruistic systems that evolved for delivering benefits to close genetic relatives; what we call kin-selected mechanisms (mammary glands being a prime example). This pattern of results from the footbridge dilemma suggests there is a distinction between our altruistic systems (that benefit others) and our moral ones; they function to do different things and, as it seems, our moral systems are not much better suited to dealing with the global altruism problem than empathetic ones. Indeed, one of the main features of our moral systems is nonconsequentialism: the idea that the moral value of an act depends on more than just the net consequences to others. If one is seeking to be an effective altruist, then, using the moral system to guide behavior seems to be a poor way to solve that problem because our moral system frequently focuses on behavior per se at the expense of its consequences. 

That’s not the only reason to be wary of the power of morality to solve effective altruism problems either. As I have argued elsewhere, our moral systems function to manage associations with others, most typically by strategically manipulating our side-taking behavior in conflicts (Marczyk, 2015). Provided this description of morality’s adaptive function is close to accurate, the metaphorical goal of the moral system is to generate and maintain partial social relationships. These partial relationships, by their very nature, oppose the goals of effective altruism, which are decidedly impartial in scope. The reasoning of effective altruism might, for instance, suggest that it would be better for parents to spend their money not on their child’s college tuition, but rather on relieving dehydration in a population across the world. Such a conclusion would conflict not only with the outputs of our kin-selected altruistic systems, but can also conflict with other aspects of our moral systems. As some of my own, forthcoming research finds, people do not appear to perceive much of a moral obligation for strangers to direct altruism towards other strangers, but they do perceive something of an obligation for friends and family to help each other (specifically when threatened by outside harm). Our moral obligations towards existing associates make us worse effective altruists (and, in Bloom’s sense of the word, morally worse people in turn).

While Bloom does mention that no one wants to live in that kind of strictly utilitarian world – one in which the welfare of strangers is treated equally to the welfare of friends and kin – he does seem to be advocating we attempt something close to it when he writes:

Our best hope for the future is not to get people to think of all humanity as family—that’s impossible. It lies, instead, in an appreciation of the fact that, even if we don’t empathize with distant strangers, their lives have the same value as the lives of those we love.

Appreciation of the fact that the lives of others have value is decidedly not the same thing as behaving as if they have the same value as the ones we love. Like most everyone else in the world, I want my friends and family to value my welfare above the welfare of others; substantially so, in fact. There are obvious adaptive benefits to such relationships, such as knowing that I will be taken care of in times of need. By contrast, if others showed no particular care for my welfare, but rather just sought to relieve as much suffering as they could wherever it existed in the world, there would be no benefit to my retaining them as associates; they would provide with me assistance or they wouldn’t, regardless of the energy I spent (or didn’t) maintaining social relationship with them. Asking the moral system to be a general-purpose altruism device is unlikely to be much more successful than asking my car to be an efficient oven, that people to treat others the world over as if they were kin, or that you empathize with 1000 people. It represents an incomplete view as to the functions of our moral psychology. While morality might be impartial with respect to behavior, it is unlikely to be impartial with regard to the social value of others (which is why, also in my forthcoming research, I find that stealing to defend against an outside agent of harm is rated as more morally acceptable than doing so to buy recreational drugs).  

“You have just as much value to me as anyone else; even people who aren’t alive yet”

To top this discussion off, it is also worth mentioning those pesky, unintended consequences that sometimes accompany even the best of intentions. By relieving deaths from dehydration, malaria, and starvation today, you might be ensuring greater harm in future generations in the form of increasing the rate of climate change, species extinction, and habitat destruction brought about by sustaining larger global human populations. Assuming for the moment that was true, would that mean that feeding starving people and keeping them alive today would be morally wrong? Both options – withholding altruism when it could be provided and ensuring harm for future generations – might get the moral stamp of disapproval, depending on the reference group (from the perspective of future generations dealing with global warming, it’s bad to feed; from the perspective of the starving people, it’s bad to not feed). This is why the slight majority of participants in Kurzban et al (2012) reported that pushing and not pushing can both be morally unacceptable courses of action.  If we are relying on our moral sense to guide our behavior in this instance, then, we would unlikely be very successful in our altruistic endeavors.

References: Kurzban, R., DeScioli, P., & Fein, D. (2012). Hamilton vs. Kant: Pitting adaptations for altruism against adaptation for moral judgment. Evolution & Human Behavior, 33, 323-333.

Marczyk, J. (2015). Moral alliance strategies theory. Evolutionary Psychological Science, 1, 77-90.

Examining Some Limited Data On Open Relationships

Thanks to Facebook, the topic of non-monogamous relationships has been crossing my screen with some regularity lately. One of the first instances involved the topic of cuckoldry: cases in which a man’s committed female partner will have sex with, and become pregnant by, another another man, often while the man in the relationship is fully aware of the situation; perhaps he’s even watching. The article discussing the matter came from Playboy which, at one point, suggested that cuckoldry porn is the second most common type of porn sought out in online searches; a statement that struck me as rather strange. While I was debating discussing that point – specifically because it doesn’t seem to be true (not only does cuckold porn, or related terms, not hold the number 2 slot in PornHub’s data searches, it doesn’t even crack the top 10 or 20 searches in any area of the world) – I decided it wasn’t worth a full-length feature, in no small part because I have no way of figuring out how such data was collected barring purchasing a book 

“To put our findings in context, please light $30 on fire”

The topic for today is not cuckoldry per se, but it is somewhat adjacent to the matter: open relationships and polyamory. Though the specifics of these relationships vary from couple to couple, the general arrangements being considered are relationships that are consensually non-monogamous, permitting one or more of the members to engage in sexual relationships with individuals outside of the usual dyad pair, at least in some contexts. Such relationships are indeed curious, as a quick framing of the issue in a nonhuman example would show. Imagine, for instance, that a researcher in the field observed a pair-bonded dyad of penguins. Every now and again, the resident male would allow – perhaps even encourage – his partner to go out and mate with another male. While such an arrangement might have its benefits for the female – such as securing paternity from a male of higher status than her mate – it would seem to be a behavior that is quite costly from the male’s perspective. The example can just as easily be flipped with regard to sex: a female that permitted her partner to go off and mate with/invest in the offspring of another female would seem to be suffering a cost, relative to a female that retained such benefits for herself. Within this nonhuman example, I suspect no one would be proposing that the penguins benefit from such an arrangement by removing pressure from themselves to spend time with their partners, or by allowing the other to do things they don’t want to do, like go out dancing. While humans are not penguins, discussing the behavior in the context of others other animals can remove some of less-useful explanations for it that are floated by people (in this case, people might quickly understand that couples can spend time apart and doing different things without needing to have sex with other partners).

The very real costs of such non-monogamous behavior can be seen in the form of psychological mechanisms governing sexual jealousy in men and women. If such behavior did not reliably carry costs for the other partner, mechanisms for sexual jealousy would not be expected to exist (and, in fact, they may well not exist for other species where associations between parents ends following copulation). The expectation of monogamy seems to be the key factor separating pair-bonds from other social associations – such as friendship and kinship – and when that expectation is broken in the form of infidelity, it often leads to the dissolution of the bond. Given that theoretical foundation, what are we to make of open relationships? Why do they exist? How stable are they, compared to monogamous relationships? Is it a lifestyle that just anyone might adopt successfully? At the outset, it’s worth noting that there doesn’t seem to exist a wealth of good empirical data on the matter, making it hard to answer such questions definitively. There are, however, two papers that discuss the topic I wanted to examine today to start making some progress on those fronts. 

The first study (Rubin & Adams, 1986) examined martial stability between monogamous and open relationships over a five-year period from 1978-1983 (though precisely how open these relationships were is unknown). Their total sample was unfortunately small, beginning with 41 demographically-matched couples per group and ending with 34 sexually-open couples and 39 monogamous ones (the authors refer to this as an “embarrassingly small” number). As for why the attrition rate obtained, two of the non-monogamous couples couldn’t be located and five of the couples had suffered a death, compared with one missing and one death in the monogamous group. Why so many deaths appeared to be concentrated in the open group is not mentioned, but as the average age of the sample at follow up was about 46 and the ages of the participants ranged from 20-80, is possible that age-related factors were responsible.

Concerning the stability of these relationships over those five years, the monogamous group reported a separation rate of 18%, while 32% of those in the open relationships reported no longer being together with their primary partner. Though this difference was not statistically significant, those in open relationships were nominally almost twice as likely to have broken up with their primary partner. Again, the sample size here is small, so interpreting those numbers is not a straightforward task. That said, Rubin & Adams (1986) also mention that both monogamous and open couples report similar levels of jealously and happiness in those relationships, regardless of whether they broke up or stayed together. 

However, there’s the matter of representativeness….

It’s difficult to determine how many couples we ought to have expected to have broken up during that time period, however. This study was conducted during the early 80s, and that time period apparently marked a high-point in US divorce frequency. That might put the separation figures in some different context, though it’s not easy to say what that context is: perhaps the monogamous/open couples were unusually likely to have stayed together/broken up, relative to the population they were drawn from. On top of being small, then, the sample might also fail to represent the general population. The authors insinuate as much, noting that they were using an opportunity sample for their research. Worth noting, for instance, is that about 90% of their subjects held a college degree, which is exceedingly high even by today’s standards (about 35% of contemporary US citizens do); a full half of them even had MAs, and 20% had PhDs (11% and 2% today). As such, getting a sense for the demographics of the broader polyamorous community – and how well they match the general population – might provide some hints (but not strong conclusions) as to whether such a lifestyle would work well for just anyone. 

Thankfully, a larger data set containing some demographics from polyamorous individuals does exist. Approximately 1,100 polyamorous people from English-speaking countries were recruited by Mitchell et al (2014) via hundreds of online sources. For inclusion, the participants needed to be at least 19 years old, currently involved in two or more relationships, and have partners that did not participate in the survey (so as to make the results independent of each other). Again, roughly 70% of their sample held an undergraduate degree or higher, suggesting that the more sexually-open lifestyle appear to disproportionately attract the well-educated (that, or their recruitment procedure was only capturing individuals very selectively). However, another piece of the demographic information from that study sticks out: reported sexual orientations. The males in Mitchell et al (2014) reported a heterosexual orientation about 60% of the time, whereas the females reported a heterosexual orientation a mere 20% of the time. The numbers for other orientations (male/female) were similarly striking: bisexual or pansexual (28%/68%), homosexual (3%/4%), or other (7%/9%).

There are two very remarkable things about that finding: first, the demographics from the polyamorous group are divergent – wildly so – from the general population. In terms of heterosexuality, general populations tend to report such an orientation about 97-99% of the time. To find, then, that heterosexual orientations dropped to about 60% in men and 20% in women represents a rather enormous gulf. Now it is possible that those reporting their orientation in the polyamorous sample were not being entirely truthful – perhaps by exaggerating – but I have no good reason to assume that is the case, nor would I be able to accurately estimate by how much those reports might be driven by social desirability concerns, assuming they are at all. That point aside, however, the second remarkable thing about this finding is that Mitchell et al (2014) don’t seem to even notice how strange it is, failing to make mention of that difference at all. Perhaps that’s a factor of it not really being the main thrust of their analysis, but I certainly find that piece of information worthy of deeper consideration. If your sample has a much greater degree of education and incidence of non-heterosexuality than is usual, that fact shouldn’t be overlooked.

Their most common major was in gettin’ down

In general, from this limited peek into the less-monogamous relationships and individuals in the world, the soundest conclusion one might be able to draw is that those who engage in such relationships are likely different than those who do not in some important regards; we can see that in the form of educational attainment and sexual orientation in the present data set, and it’s likely that other, unaccounted for differences exist as well. What those differences might or might not be, I can’t rightly say at the moment. Nevertheless, this non-representativeness could well explain why the polyamorists and monogamists have such difficulty seeing eye-to-eye on the issue of exclusivity. However, sexual topics tend to receive quite a bit of moralization in all directions, and this can impede good scientific progress in understanding the issue. If, for instance, one is seeking to make polyamory appear to be more normative, important psychological differences between groups might be overlooked (or not asked about/reported in the first place) in the interests of building acceptance; if one views them as something to be discouraged, one’s interpretation of the results will likely follow suit as well.

References: Mitchell, M., Bartholomew, K., & Cobb, R. (2014). Need fulfillment in polyamorous relationships. Journal of Sex Research, 21, 329-339.

Rubin, A. & Adams, J. (1986). Outcomes of sexually open marriages. The Journal of Sex Research, 22, 311-319.

Punishment Might Signal Trustworthiness, But Maybe…

As one well-known saying attributed to Maslow goes, “when all you have is hammer, everything looks like a nail.” If you can only do one thing, you will often apply that thing as a solution to a problem it doesn’t fit particularly well. For example, while a hammer might make for a poor cooking utensil in many cases, if you are tasked with cooking a meal and given only a hammer, you might try to make the best of a bad situation, using the hammer as an inefficient, makeshift knife, spoon, and spatula. That you might meet with some degree of success in doing so does not tell you that hammers function as cooking implements. Relatedly, if I then gave you a hammer and a knife, and tasked with you the same cooking jobs, I would likely observe that hammer use drops precipitously while knife use increases quite a bit. It is also worth bearing in mind that if the only task you have to do is cooking, the only conclusion I’m realistically capable of drawing concerns whether a tool is designed for cooking. That is, if I give you a hammer and a knife and tell you to cook something, I won’t be able to draw the inference that hammers are designed for dealing with nails because nails just aren’t present in the task.

Unless one eats nails for breakfast, that is

While all that probably sounds pretty obvious in the cooking context, a very similar set up appears to have been used recently to study whether third-party punishment (the punishment of actors by people not directly affected by their behavior; hereafter TPP) functions to signal the trustworthiness of the punisher. In their study, Jordan et al (2016) has participants playing a two-stage economic game. The first stage was a TPP game. In this game, there are three players: player A is the helper, and is given 30 cents, player B is the recipient, and given nothing, and player C is the punisher, given 20 cents. The helper can choose to either give the recipient 15 cents or nothing. If the helper decides to give nothing, the punisher then has the option to pay 5 cents to reduce the helper’s pay by 15 cents, or not do so. In this first stage, the first participant would either play one round as a helper or a punisher, or play two rounds: one in the role of the helper and another in the role of the punisher.

The second stage of this game involved a second participant. This participant observed the behavior of the people playing the first game, and then played a trust game with the first participant. In this trust game, the second participant is given 30 cents and decides how much, if any, to send to the first participant. Any amount sent is tripled, and then the first participant decides how much of that amount, if any, to send back. The working hypothesis of Jordan et al (2016) is that TPP will be used a signal of trustworthiness, but only when it is the only possible signal; when participants have an option to send better signals of trustworthiness – such as when they are in the roll of the helper, rather than the punisher – punishment will lose its value as a signal for trust. By contrast, helping should always serve as a good signal of trustworthiness, regardless of whether punishment is an option.

Indeed, this is precisely what they found. When the first participant was only able to punish, the second participant tended to trust punishers more, sending them 16% more in the trust game than non-punishers; in turn, the punishers also tended to be slightly more trustworthy, sending back 8% more than non-punishers. So, the punishers were slightly, though not substantially, more trustworthy than the non-punishers when punishing was all they could do. However, when participants were in the helper role (and not the punisher role), those who transferred money to the recipient were in turn trusted more – being sent an average of 39% more in the trust game than non-helpers – and were, in fact, more trustworthy – returning an average of 25% more than non-helpers. Finally, when the first participant was in the role of both the punisher and the helper, punishment was less common (30% of participants in both roles punished, whereas 41% of participants who were only punishers did) and, controlling for helping, punishers were only trusted with 4% more in the second stage and actually returned 0.3% less.

The final task was less about trust and more about upper-body strength

To sum up, then, when people only had the option to punish others, punishment behavior was used by observers as a cue to trustworthiness. However, when helping was possible as well, punishment ceased to predict trustworthiness. From this set of findings, the authors make the rather strange conclusion that “clear support” was found for their model of punishment as signaling trustworthiness. My enthusiasm for that interpretation is a bit more tepid. To understand why, we can return to my initial example: you have given people a tool (a hammer/punishment) and a task (cooking/a trust game). When they use this tool in the task, you see some results, but they aren’t terribly efficient (16% more trusted and 8% more returned). Then, you give them a second tool (a knife/helping) to solve the same task. Now the results are much better (39% more trusted, 25% more returned). In fact, when they have both tools, they don’t seem to use the first one to accomplish the task as much (punishment falls 11%) and, when they do, they don’t end up with better outcomes (4% more trusted, 0.3% less returned). From that data alone, I would say that the evidence does not support the inference that punishment is a mechanism for signaling trustworthiness. People might try using it in a pinch, but its value seems greatly diminished compared to other behaviors.  

Further, the only tasks people were doing involved playing a dictator and trust game. If punishment serves some other purpose beyond signaling trustworthiness, you wouldn’t be able to observe it there because people aren’t in the right contexts for it to be observed. To make that point clear, we could consider other examples. First, let’s consider murder. If I condemn murder morally and, as a third party, punish someone for engaging in murder, does this tell you that I am more trustworthy than someone else who doesn’t punish it themselves? Probably not; almost everyone condemns murder, at least in the abstract, but the costs of engaging in punishment aren’t the same for all people. Someone who is just as trustworthy might not be willing or able to suffer the associated costs. What about something a bit more controversial: let’s say that, as a third party, I punish people for obtaining or providing abortions. Does hearing about my punishment make me seem like a more trustworthy person? That probably depends on what side of the abortion issue you fall on.

To put this in more precise detail, here’s what I think is going on: the second participant – the one sending money in the trust game, so let’s call him the sender – primarily wants to get as much money back as possible in this context. Accordingly, they are looking for cues that the first participant – the one they’re trusting, or the recipient – is an altruist. One good cue for altruism is, well, altruism. If the sender sees that the recipient has behaved altruistically by giving someone else money, this is a pretty good cue for future altruism. Punishment, however, is not the same thing as altruism. From the point of the view of the person benefiting from the punishment, TPP is indeed altruistic; from the point of view of the target of that TPP, the punishment is spiteful. While punishment can contain this altruistic component, it is more about trading off the welfare of others, rather than providing benefits to people per se. While that altruistic component of punishment can be used as a cue for trustworthiness in a pinch when no other information is available, that does not suggest to me sending such a signal is its only, or even its primary function.

Sure, they can clean the floors, but that’s not really why I hired them

In the real world, people’s behaviors are not ever limited to just the punishment of perpetrators. If there are almost always better ways to signal one’s trustworthiness, then TPP’s role in that regard is likely quite low. For what it’s worth, I happen to think that the roll of TPP has more to do with using transient states of need to manage associations (friendships) with others, as such an explanation works well outside the narrow boundaries of the present paper when things other than unfairness are being punished and people are seeking to do more than make as much money as possible. Finding a good friend is not the same thing as finding a good altruist, and friendships do not usually resemble trust games. However, when all you are observing is unfairness and cooperation, TPP might end up looking a little bit like a mechanism for building trust. Sometimes. If you sort of squint a bit.

References: Jordan, K., Hoffman, M., Bloom, P. & Rand. D. (2016). Third-party punishment as a costly signal of trustworthiness. Nature, 530, 473-476.