Getting To Know Your Outliers: More About Video Games

As I mentioned in my last post, I’m a big fan of games. For the last couple of years, the game which has held the majority of my attention has been a digital card game. In this game, people have the ability to design decks with different strategies, and the success of your strategy will depend on the strategy of your own opponent; you can think of it as a more complicated rock-paper-scissors component. The players in this game are often interested in understanding how well certain strategies match up against others, so, for the sake of figuring that out, some have taken it upon themselves to collect data from the players to answer those questions. You don’t need to know much about the game to understand the example I’m about to discuss, but let’s just consider two decks: deck A and deck B. Those collecting the data managed to aggregate the outcome of approximately 2,200 matches between the two and found that, overall, deck A was favored to win the match 55% of the time. This should be some pretty convincing data when it comes to getting a sense for how things generally worked out, given the large sample size.

Only about 466 more games to Legend with that win rate

However, this data will only be as useful to us as our ability to correctly interpret it. A 55% success rate captures the average performance, but there is at least one well-known outlier player within the game in that match. This individual manages to consistently perform at a substantially higher level than average, achieving wins in that same match up around 70-90% of the time across large sample sizes. What are we to make of that particular data point? How should it affect our interpretation of the match? One possible interpretation is that his massively positive success rate is simply due to variance and, given enough games, the win rate of that individual should be expected to drop. It hasn’t yet, as far as I know. Another possible explanation is that this player is particularly good, relative to his opponents, and that factor of general skill explains the difference. In much the same way, an absolutely weak 15-year-old might look pretty strong if you put him in a boxing match against a young child. However, the way the game is set up you can be assured that he will be matched against people of (relatively) equal skill, and that difference shouldn’t account for such a large disparity.

A third interpretation – one which I find more appealing, given my deep experience with the game – is that skill matters, but in a different way. Specifically, deck A is more difficult to play correctly than deck B; it’s just easier to make meaningful mistakes and you usually have a greater number of options available to you. As such, if you give two players of average skill decks A and B, you might observe the 55% win rate initially cited. On the other hand, if you give an expert player both decks (one who understands that match as well as possible), you might see something closer to the 80% figure. Expertise matters for one deck a lot more than the other. Depending on how you want to interpret the data, then, you’ll end up with two conclusions that are quite different: either the match is almost even, or the match is heavily lopsided. I bring this example up because it can tell us something very important about outliers: data points that are, in some way, quite unusual. Sometimes these data points can be flukes and worth disregarding if we want to learn about how relationships in the world tend to work; other times, however, these outliers can provide us valuable and novel insights that re-contextualize the way we look at vast swaths of other data points. It all hinges on the matter of why that outlier is one. 

This point bears on some reactions I received to the last post I wrote about a fairly-new study which finds no relationship between violent content in video games and subsequent measures of aggression once you account for the difficulty of a game (or, perhaps more precisely, the ability of a game to impede people’s feelings of competence). Glossing the results into a single sentence, the general finding is that the frustration induced by a game, but not violent content per se, is a predictor of short-term changes in aggression (the gaming community tends to agree with such a conclusion, for whatever that’s worth). In conducting this research, the authors hoped to address what they perceived to be a shortcoming in the literature: many previous studies had participants play either violent or non-violent games, but they usually achieved this method by having them play entirely different games. This means that while violent content did vary between conditions, so too could have a number of other factors, and the presence of those other factors poses some confounds in interpreting the data. Since more than violence varied, any subsequent changes in aggression are not necessarily attributable to violent content per se.

Other causes include being out $60 for a new controller

The study I wrote about, which found no effect of violence, stands in contrast to a somewhat older meta-analysis of the relationship between violent games and aggression. A meta-analysis – for those not in the know – is when a larger number of studies are examined jointly to better estimate the size of some effect. As any individual study only provides us with a snapshot of information and could be unreliable, it should be expected that a greater number of studies will provide us with a more accurate view of the world, just like running 50 participants through an experiment should give us a better sense than asking a single person or two. The results of some of those meta-analyses seem to settle on a pretty small relationship between violent video games and aggression/violence (approximately r = .15 to .20 for non-serious aggression, and about r = .04 for serious aggression depending on who you ask and what you look at; Anderson et a, 2010; Ferguson & Kilburn, 2010; Bushman et al, 2010), but there have been concerns raised about publication bias and the use of non-standardized measures of aggression.

Further, were there no publication bias to worry about, that does not mean the topic itself is being researched by people without biases, which can affect how data gets analyzed, research gets conducted, measures get created and interpreted, and so on. If r = .2 is about the best one can do with those degrees of freedom (in other words, assuming the people conducting such research are looking for the largest possible effect and develop their research accordingly), then it seems unlikely that this kind of effect is worth worrying too much about. As Ferguson & Kilburn (2010) note, youth violent crime rates have been steadily decreasing as the sale of violent games have been increasing (r = -.95; as well, the quality of that violence has improved over time; not just the quantity. Look at the violence in Doom over the years to get a better sense for that improvement). Now it’s true enough that the relationship between youth violent crime and violent video game sales is by no means a great examination of the relationship in question, but I do not doubt that if the relationship ran in the opposite direction (especially if were as large), many of the same people who disregard it as unimportant would never leave it alone.

Again, however, we run into that issue where our data is only as good as our ability to interpret it. We want to know why the meta-analysis turned up a positive (albeit small) relationship whereas the single paper did not turn up such a relationship, despite multiple chances to find it. Perhaps the paper I wrote about was simply a statistical fluke; for whatever reason, the samples recruited for those studies didn’t end up showing the effect of violent content, but the effect is still real in general (perhaps it’s just too small to be reliably detected). That seems to be the conclusion some responses I received contained. In fact, I had one commenter who cited the results of three different studies suggesting there was a casual link between violent content and aggression. However, when I dug up those studies and looked at the methods section, what I found was that, as I mentioned before, all of them had participants play entirely different games between violent and non-violent conditions. This messes with your ability to interpret the data only in light of violent content, because you are varying more than just violence (even if unintentionally). On the other hand, the paper I mentioned in my last post had participants playing the same game between conditions, just with content (like difficulty or violence levels) manipulated. As far as I can tell, then, the methods of the paper I discussed last week were superior, since they were able to control more, apparently-important factors.

This returns us to the card game example I raised initially: when people play a particular deck incorrectly, they find it is slightly favored to win; when someone plays it correctly they find it is massively favored. To turn that point to this analysis, when you conduct research that lacks the proper controls, you might find an effect; when you add those controls in, the effect vanishes. If one data point is an outlier because it reflects research done better than the others, you want to pay more attention to it. Now I’m not about to go digging through over 130 studies for the sake of a single post – I do have other things on my plate – but I wanted to make this point clear: if a meta-analysis contains 130 papers which all reflect the same basic confound, then looking at them together makes me no more convinced of their conclusion than looking at any of them alone (and given that the specific studies that were cited in response to my post all did contain that confound, I’ve seen no evidence inconsistent with that proposal yet). Repeating the same mistake a lot does not make it cease to be a mistake, and it doesn’t impress me concerning the weight of the evidence. The evidence acquired through weak methodologies is light indeed.  

Research: Making the same mistakes over and over again for similar results

So, in summation, you want to really get to know your data and understand why it looks the way it does before you draw much in the way of meaningful conclusions from it. A single outlier can potentially tell you more about what you want to know than lots of worse data points (in fact, it might not even be the case that poorly-interpreted data is recognized as such until contrary evidence rears its head). This isn’t always the case, but to write off any particular data point because it doesn’t conform to the rest of the average pattern – or to assume its value is equal to that of other points – isn’t always right either. Meeting your data, methods, and your measures is quite important for getting a sense for how to interpret it all. 

For instance, it has been proposed that – sure – the relationship between violent game content and aggression is small at best (there seems to be some heated debate over whether it’s closer to r = .1 or .2) but it could still be important because lots of small effects can add up over time into a big one. In other words, maybe you ought to be really wary of that guy who has been playing a violent game for an hour each night for the last three years. He could be about to snap at the slightest hint of a threat and harm you…at least to the extent that you’re afraid he might suggest you listen to loud noises or eat slightly more of something spicy; two methods used to assess “physical” aggression in this literature due to ethical limitations (despite the fact that, “Naturally, children (and adults) wishing to be aggressive do not chase after their targets with jars of hot sauce or headphones with which to administer bursts of white noise.” That small, r = .2 correlation I referenced before concerns behavior like that in a lab setting where experimental demand characteristics are almost surely present, suggesting the effect on aggressive behavior in naturalistic settings is likely overstated.)

Then again, in terms of meaningful impact, perhaps all those small effects weren’t really mounting to much. Indeed, the longitudinal research in this area seems to find the smallest effects (Anderson et al, 2010). To put that into what I think is a good example, imagine going to the gym. Listening to music helps many people work out, and the choice of music is relevant there. The type of music I would listen to when at the gym is not always the same kind I would listen to if I wanted to relax, or dance, or set a romantic mood. In fact, the music I listen to at the gym might even make me somewhat more aggressive in a manner of speaking (e.g., for an hour, aggressive thoughts might be more accessible to me while I listen than if I had no music, but that don’t actually lead to any meaningful changes in my violent behavior while at the gym or once I leave that anyone can observe). In that case, repeated exposure to this kind of aggressive music would not really make me any more aggressive in my day-to-day life than you’d expect overtime.

Thankfully, these warnings managed to save people from dangerous music

That’s not to say that media has no impact on people whatsoever: I fully suspect that people watching a horror movie probably feel more afraid than they otherwise would; I also suspect someone who just watched an action movie might have some violent fantasies in their head. However, I also suspect such changes are rather specific and of a short duration: watching that horror movie might increase someone’s fear of being eaten by zombies or ability to be startled, but not their fear of dying from the flu or their probability of being scared next week; that action movie might make someone think about attacking an enemy military base in the jungle with two machine guns, but it probably won’t increase their interest in kicking a puppy for fun, or lead to them fighting with their boss next month. These effects might push some feelings around in the very short term, but they’re not going to have lasting and general effects. As I said at the beginning of last week, things like violence are strategic acts, and it doesn’t seem plausible that violent media (like, say, comic books) will make them any more advisable.

References: Anderson, C. et al. (2010). Violent video game effects on aggression, empathy, and prosocial behavior in eastern and western counties: A meta-analytic review. Psychological Bulletin, 136, 151-173.

Bushman, B., Rothstein, H., & Anderson, C. (2010). Much ado about something: Violent video game effects and school of red herring: Reply to Ferguson & KIlburn (2010). Psychological Bulletin, 136, 182-187.

Elson, M. & Ferguson, C. (2013). Twenty-five years of research on violence in digital games and aggression: Empirical evidence, perspectives, and a debate gone astray. European Psychologist, 19, 33-46.

Ferguson, C. & Kilburn, J. (2010). Much ado about nothing: The misestimation and overinterpretation of violent video game effects in eastern and western nations: Comment on Anderson et al (2010). Psychological Bulletin, 136, 174-178.

Count The Hits; Not The Misses

At various points in our lives, we have all read or been told anecdotes about how someone turned a bit of their life around. Some of these (or at least variations of them) likely sound familiar: “I cut out bread from my diet and all the sudden felt so much better”; “Amy made a fortune working from home selling diet pills online”; “After the doctors couldn’t figure out what was wrong with me, I started drinking this tea and my infection suddenly cleared up”. The whole point of such stories is to try and draw a casual link, in these cases: (1) eating bread makes you feel sick, (2) selling diet pills is a good way to make money, and (3) tea is useful for combating infections. Some or all of these statements may well be true, but the real problem with these stories is the paucity of data upon which they are based. If you wanted to be more certain about those statements, you want more information. Sure; you might have felt better after drinking that tea, but what about the other 10 people who drank similar tea and saw no results? How about all the other people selling diet pills who were in the financial hole from day one and never crawled out of it because it’s actually a scam? If you want to get closer to understanding the truth value of those statements, you need to consider the data as a whole; both stories of success and stories of failure. However, stories of someone not getting rich from selling diet pills aren’t quite as moving, and so don’t see the light of day; at least not initially. This facet of anecdotes was made light of by The Onion several years ago (and Clickhole had their own take more recently).

“At first he failed, but with some positive thinking he continued to fail over and over again”

These anecdotes often try and throw the spotlight on successful cases (hits) while ignoring the unsuccessful ones (misses), resulting in a biased picture of how things will work out. They don’t get us much closer to the truth. Most people who create and consume psychology research would like to think that psychologists go beyond these kinds of anecdotes and generate useful insights into how the mind works, but there have been a lot of concerns raised lately about precisely how much further they go on average, largely owing the the results of the reproducibility project. There have been numerous issues raised about the way psychology research is conducted: either in the form of advocacy for particular political and social positions (which distorts experimental designs and statistical interpretations) or the selective ways in which data is manipulated or reported to draw attention to successful data without acknowledging failed predictions. The result has been quite a number of false positives and overstated real ones cropping up in the literature.

While these concerns are warranted, it is difficult to quantify the extent of the problems. After all, very few researchers are going to come out and say they manipulated their experiments or data to find the results they wanted because (a) it would only hurt their careers and (b) in some cases, they aren’t even aware that they’re doing it, or that what they’re doing is wrong. Further, because most psychological research isn’t preregistered and null findings aren’t usually published, figuring out what researchers hoped to find (but did not) becomes a difficult undertaking just by reading the literature. Thankfully, a new paper from Franco et al (2016) brings some data to bear on the matter of how much underreporting is going on. While this data will not be the final word on the subject by any means (largely owing to their small sample size), they do provide some of the first steps in the right direction.

Franco et al (2016) report on a group of psychology experiments whose questionnaires and data were made publicly available. Specifically, these come from the Time-sharing Experiments for the Social Sciences (TESS), an NSF program in which online experiments are embedded in nationally-representative population surveys. Those researchers making use of TESS face strict limits on the number of questions they can ask, we are told, meaning that we ought to expect they would restrict their questions to the most theoretically-meaningful ones. In other words, we can be fairly confident that the researchers had some specific predictions they hoped to test for each experimental condition and outcome measure, and that these predictions were made in advance of actually getting the data. Franco et al (2016) were then able to track the TESS studies through to the eventual published versions of the papers to see what experimental manipulations and results were and were not reported. This provided the authors with a set of 32 semi-preregistered psychology experiments to examine for reporting biases.

A small sample I will recklessly generalize to all of psychology research

The first step was to compare the number of experimental conditions and outcome variables that were present in the TESS studies to the number that ultimately turned up in published manuscripts (i.e. are the authors reporting what they did and what they measured?). Overall, 41% of the TESS studies failed to report at least one of their experimental conditions; while there were an average of 2.5 experimental conditions in the studies, the published papers only mentioned an average of 1.8. In addition, 72% of the papers failed to report all their outcomes variables; while there were an average of 15.4 outcome variables in the questionnaires, the published reports only mentioned 10.4  Taken together, only about 1-in-4 of the experiments reported all of what they did and what they measured. Unsurprisingly, this pattern extended to the size of the reported effects as well. In terms of statistical significance, the median reported p-value was significant (.02), while the median unreported p-value was not (.32); two-thirds of the reported tests were significant, while only one-forth of the unreported tests were. Finally, published effect sizes were approximately twice as large as unreported ones.

Taken together, the pattern that emerged is that psychology research tends to underreport failed experimental manipulations, measures that didn’t pan out, and smaller effects. This should come as no surprise to almost anyone who has spent much time around psychology researchers or the researchers themselves who have tried to publish null findings (or, in fact, have tried to publish almost anything). Data is often messy and uncooperative, and people are less interested in reading about the things that didn’t work out (unless they’re placed in the proper contexts, where failures to find effects can actually be considered meaningful, such as when you’re trying to provide evidence against a theory). Nevertheless, the result of such selective reporting on what appears to be a fairly large scale is that the overall trustworthiness of reported psychology research dips ever lower, one false-positive at a time.

So what can be done about this issue? One suggestion that is often tossed around is the prospect that researchers should register their work in advance, making it clear what analyses they will be conducting and what predictions they have made. This was (sort of) the case in the present data, and Franco et al (2016) endorse this option. It allows people to assess research as more of a whole than just relying on the published accounts of it. While that’s a fine suggestion, it only goes so far to improving the state of the literature. Specifically, it doesn’t really help the problem of journals not publishing null findings in the first place, nor does it necessarily disallow researchers from doing post-hoc analyses of their data either and turning up additional false positives. What is perhaps a more ambitious way of alleviating these problems that comes to mind would be to collectively change the way journals accept papers for publication. In this alternate system, researchers would submit an outline of their article to a journal before the research is conducted, making clear (a) what their manipulations will be, (b) what their outcome measures will be, and (c) what statistical analyses they will undertake. Then, and this is important, before either the researcher or the journals know what the results will be, the decision will be made to publish the paper or not. This would allow null results to make their way into mainstream journals while also allowing the researchers to build up their own resumes if things don’t work out well. In essence, it removes some of the incentives for researchers to cheat statistically. The assessment of the journals will then be based not on whether interesting results emerged, but rather on whether a sufficiently important research question had been asked.

Which is good, considering how often real, strong results seem to show up

There are some downsides to that suggestion, however. For one, the plan would take some time to enact even if everyone was on board. Journals would need to accept a paper for publication weeks or months in advance of the paper itself actually being completed. This would pose some additional complications for journals inasmuch as researchers will occasionally fail to complete the research at all, in timely manner, or submit sub-par papers not worthy of print quite yet, leaving possible publication gaps. Further, it will sometimes mean that an issue of a journal goes out without containing any major advancements to the field of psychological research (no one happened to find anything this time), which might negatively affect the impact factor of the journals in question. Indeed, that last part is probably the biggest impediment to making major overhauls to the publication system that’s currently in place: most psychology research probably won’t work out all that well, and that will probably mean fewer people ultimately interested in reading about and citing it. While it is possible, I suppose, that null findings would actually be cited at similar rates to positive ones, that remains to be seen, and in the absence of that information I don’t foresee journals being terribly interested in changing their policies and taking that risk.

References: Franco, A., Malhotra, N., & Simonovits, G. (2016). Underreporting in psychology experiments: Evidence from a study registry. Social Psychological & Personality Science, 7, 8-12.

Smart People Are Good At Being Dumb In Politics

While I do my best to keep politics out of my life – usually by selectively blocking people who engage in too much proselytizing via link spamming on social media – I will never truly be rid of it. I do my best to cull my exposure to politics, not because I am lazy and looking to stay uninformed about the issues, but rather because I don’t particularly trust most of the sources of information I receive to leave me better informed than when I began. Putting this idea in a simple phrase, people are biased. In these socially-contentious domains, we tend to look for evidence that supports our favored conclusions first, and only stop to evaluate it later, if we do at all. If I can’t trust the conclusions of such pieces to be accurate, I would rather not waste my time with them at all, as I’m not looking to impress a particular partisan group with my agreeable beliefs. Naturally, since I find myself disinterested in politics – perhaps even going so far as to say I’m biased against such matters – this should mean I am more likely to approve of research that concludes people engaged with political issues aren’t quite good at reaching empirically-correct conclusions. Speaking of which… 

“Holy coincidences, Batman; let’s hit them with some knowledge!”

A recent paper by Kahan et al (2013) examined how people’s political beliefs affected their ability to reach empirically-sound conclusions in the face of relevant evidence. Specifically, the authors were testing two competing theories for explaining why people tended to get certain issues wrong. The first of these is referred to as the Science Comprehension Thesis (SCT), which proposes that people tend to get different answers to questions like, “Is global warming affected by human behavior?” or “Are GMOs safe to eat?” simply because they lack sufficient education on such topics or possess poor reasoning skills. Put in more blunt terms, we might (and frequently do) say that people get the answers to such questions wrong because they’re stupid or ignorant. The competing theory the authors propose is called the Identity-Protective Cognition Thesis (ICT) which suggests that these debates are driven more by people’s desire to not be ostracized by their in-group, effectively shutting off their ability to reach accurate conclusions. Again, putting this in more blunt terms, we might (and I did) say that people get the answers to such questions wrong because they’re biased. They have a conclusion they want to support first, and evidence is only useful inasmuch as it helps them do that.

Before getting to the matter of politics, though, let’s first consider skin cream. Sometimes people develop unpleasant rashes on their skin and, when that happens, people will create a variety of creams and lotions designed to help heal the rash and remove its associated discomfort. However, we want to know if these treatments actually work; after all, some rashes will go away on their own, and some rashes might even get worse following the treatment. So we do what any good scientist does: we conduct an experiment. Some people will use the cream while others will not, and we track who gets better and who gets worse. Imagine, then, that you are faced with the following results from your research: of the people who did use the skin cream, 223 of them got better, while 75 got worse; of the people who did not use the cream, 107 got better, while 21 got worse. From this, can we conclude that the skin cream works?

A little bit of division tells us that, among those who used the cream, about 3 people got better for each 1 who got worse; among those not using the cream, roughly 5 people got better for each 1 who got worse. Comparing the two ratios, we can conclude that the skin cream is not effective; if anything, it’s having precisely the opposite result. If you haven’t guessed by now, this is precisely the problem that Kahan et al (2013) posed to 1,111 US adults (though they also flipped the numbers between the conditions so that sometimes the treatment was effective). As it turns out, this problem is by no means easy for a lot of people to solve: only about half the sample was able to reach the correct conclusion. As one might expect, though, the participant’s numeracy – their ability to use quantitative skills – did predict their ability to get the right answer: the highly-numerate participants got the answer right about 75% of the time; those in the low-to-moderate end of numeracy ability got it right only about 50% of the time.

“I need it for a rash. That’s my story and I’m sticking to it”

Kahan et al (2013) then switched up the story. Instead of participants reading about a skin cream, they instead read about gun legislation that banned citizens from carrying handguns concealed in public; instead of looking at whether a rash went away, they examined whether crime in the cities that enacted such bans went up or down, relative to those cities that did not. Beyond the change in variables, all the numbers remained exactly the same. Participants were asked whether the gun ban was effective at reducing crime.  Again, people were not particularly good at solving this problem either – as we would expect – but an interesting result emerged: the most numerate subjects were now only solving the problem correctly 57% of the time, as compared with 75% in the skin-cream group. The change of topic seemed to make people’s ability to reason about these numbers quite a bit worse.

Breaking the data down by political affiliations made it clear what was going on. The more numerate subjects were, again, more likely to get the answer to the question correct, but only when it accorded with their political views. The most numerate liberal democrats, for instance, got the answer right when the data showed that concealed carry bans resulted in decreased crime; when crime increased, however, they were not appreciably better at reaching that conclusion relative to the less-numerate democrats. This pattern was reversed in the case of conservative republicans: when the concealed carry bans resulted in increased crime, the more numerate ones got the question right more often; when the ban resulted in decreased crime, performance plummeted.

More interestingly still, the gap in performance was greatest for the more-numerate subjects. The average difference in getting the right answer among the highly-numerate individuals was about 45% between cases in which the conclusion of the experiment did or did not support their view, while it was only 20% in the case of the less-numerate ones. Worth noting is that these differences did not appear when people were thinking about the non-partisan skin-cream issue. In essence, smart people were either not using their numeracy skills regularly  in cases where it meant drawing unpalatable political conclusions, or they were using them and subsequently discarding the “bad” results. This is an empirical validation of my complaints about people ignoring base rates when discussing Islamic terrorism. Highly-intelligent people will often get the answers to these questions wrong because of their partisan biases, not because of their lack of education. They ought to know better – indeed, they do know better – but that knowledge isn’t doing them much good when it comes to being right in cases where that means alienating members of their social group.

That future generations will appreciate your accuracy is only a cold comfort

At the risk of repeating this point, numeracy seemed to increase political polarization, not make it better. These abilities are being used more to metaphorically high-five in-group members than to be accurate. Kahan et al (2013) try to explain this effect in two ways, one of which I think is more plausible than the other. On the implausible front, the authors suggest that using these numeracy abilities is a taxing, high-effort activity that people try to avoid whenever possible. As such, people with this numeracy ability only engage in effortful reasoning when their initial beliefs were threatened by some portion of the data. I find this idea strange because I don’t think that – metabolically – these kinds of tasks are particularly costly or effortful. On the more plausible front, Kahan et al (2013) suggest that these conclusions have a certain kind of rationality behind them: if drawing an unpalatable conclusion would alienate important social relations that one depends on for their own well-being, then an immediate cost/benefit analysis can favor being wrong. If you are wrong about whether GMOs are harmful, the immediate effects on you are likely quite small (unless you’re starving); on the other hand, if your opinion about them puts off your friends, the immediate social effects are quite large.

In other words, I think people sometimes interpret data in incorrect ways to suit their social goals, but I don’t think they avoid interpreting it properly because doing so is difficult.

References: Kahan, D., Peters, E., Dawson, E., & Slovic, P. (2013). Motivated numeracy and enlightened self-government. Yale Law School, Public Law Working Paper No. 307.

Exaggerating With Statistics (About Rape)

“As a professional psychology researcher, it’s my job to lie to the participants in my experiments so I can lie to others with statistics using their data”. -On understanding the role of deception in psychology research

In my last post, I discussed the topic of fear: specifically, how social and political agendas can distort the way people reason about statistics. The probable function of such distortions is to convince other people to accept a conclusion which is not exactly well supported by the available evidence. While such behavior is not exactly lying – inasmuch as the people making these claims don’t necessarily know they’re engaged in such cognitive distortions – it is certainly on the spectrum of dishonesty, as they would (and do) reject such reasoning otherwise. In the academic world, related kinds of statistical manipulations go by a few names, the one I like the most being “researcher degrees of freedom“. The spirit of this idea refers to the problem of researchers selectively interpreting their data in a variety of ways until they find a result they want to publish, and then omit mentioning all the ways that their data did not work out, or might be interpreted. On that note, here’s a scary statistic: 1-in-3 college men would rape a woman if they could get away with it. Fortunately (or unfortunately, depending on your perspective) the statistic is not at all what it seems.

“…But the researchers failed to adequately report their methods! Spooky!”

The paper in question (Edwards et al, 2014) seeks to try and understand the apparent mystery behind the following finding: when asked if they ever raped anyone, most men will say “no”; when asked instead whether they ever held someone down to coerce them into having sex, a greater percentage of men will indicate that they have. Women’s perceptions about the matter seem to follow suit. As I wrote when discussing the figure that 25% of college women will be raped:

The difference was so stark that roughly 75% of the participants that Koss had labeled as having experiencing rape did not, themselves, consider the experience to be rape.

What strikes me as curious about these findings is not the discrepancy in responses; that much can likely be explained by positing that these questions are perceived by the participants to be asking about categorically different behaviors. After all, if they were actually perceived to be asking about the same thing, you would see a greater agreement between the responses of both men and women between questions, which we do not. Instead, the curious part is that authors – like Edwards et al (2014) – continue to insist that all those participants must be wrong, writing, “…some men who rape do not seem to classify their behavior as such” (Jesse Singal at expresses a similar view, writing: “At the end of the day, after all, the two groups are saying the exact same thing“). Rather than conclude there is something wrong with the questions being asked (such as, say, they are capturing a portion of the population who would have rough, but consensual sex), they instead conclude there is something wrong with everyone else (both men and women) answering them. This latter explanation strikes me as unlikely. 

There’s already something of a bait-and-switch taking place, then, but this is far from the only methodological issue involved in deriving that scary-sounding 1-in-3 figure. Specifically, Edwards et al (2014) asked their 86 male participants to fill out part of the “attraction to sexual aggression” scale (Malamuth, 1989). On this scale, participants are asked to indicate, from 1 to 5, how likely they would be to engage in a variety of behaviors, with a “1″ corresponding to “not likely at all”, while “5″ corresponds to “very likely”. Included on this scale are two questions, one concerning whether the respondent would “rape” a woman, and another asking about whether he would “force her to do something she did not want to do” in a sexual setting. The participants in question were asked about their likelihood of engaging in such behaviors “if nobody would ever know and there wouldn’t be any consequences”. Edwards et al (2014) report that, if such criteria were met, 31% of the men would force a woman to do something sexually, whereas only 13% would rape a woman.

If you’re perceptive, you might have noticed something strange already: that 1-in-3 figure cannot be straightforwardly derived from the sexual aggression scale, as the scale is a 5-point measure, whereas the 1-in-3 statistic is clearly dichotomous. This raises the question of how one translates the scale into a yes/no response format. Edwards et al (2014) do not explicitly mention how they managed such a feat, but I think the answer is clear from the labeling in one of their tables: “Any intention to rape a woman” (emphasis, mine). What the researchers did, then, was code any response other than a “1″ as an affirmative; the statistical equivalent of saying that 2 is closer to 5 than it is to 1. In other words, the question was, “Would you rape a woman if you could get away with it”, and the answers were, effectively, “No, Yes, Yes, Yes, or Yes”. Making the matter even worse is that all that participants were answering both questions. This means they saw a question asking about “rape” and another question about “forcing a woman to do something she didn’t want to”. As participants likely figured that there was no reason the researchers would be asking the same question twice, they would have very good reason for thinking that these questions refer to categorically different things. For the authors to then conflate the two questions after the fact as being identical is stunningly disingenuous.

“The problem isn’t me; it’s everyone else”

To put these figures in better context, we could consider the results reported by Malamuth (1989). In response to the “Would you rape if you wouldn’t get caught” question, 74% of men indicated “1″ and 14% indicated a “2″, meaning a full 88% of them fell below the midpoint of the scale; by contrast, only 7% fell above the midpoint, with about 5% indicating a “4″ and 2% indicating a “5″. Of course, reporting that “1-in-3 men would rape” if they could get away with it is much different than saying “less than 1-in-10 probably would”. The authors appear interested in deriving the most-damning interpretation of their data possible, however, as evidenced by their unreported and, in my mind, unjustifiable grouping of the responses. That fact alone should raise alarm bells as to whether the statistics they provide you would do a good job of predicting reality.

But let’s go ahead and take these responses at face value anyway, even if we shouldn’t: somewhere between 10-30% of men would rape a woman if there were no consequences for doing so. How alarming should that figure be? On the first front, the hypothetical world of “no consequence” doesn’t exist. Some proportion of men who would be interested in doing such things are indeed restrained from doing so by the probability of being punished. Even within that hypothetical world of freedom from consequences, however, there are likely other problems to worry about, in that you will always find some percentage of the population willing to engage in anti-social behavior that harms others when there are no costs for doing so (in fact, the truly strange part is that lots of people indicate they would avoid such behaviors).

Starting off small, for instance, about 70% of men and women indicate that they would cheat on their committed partner if they wouldn’t get caught (and slightly over 50% have cheated in spite of those possible consequences). What about other acts, like stealing, or murder. How many people might kill someone else if there would be no consequences for it? One informal poll I found placed that number around 40%; another puts it a little above 50% and, when broken up by sex, 32% of women would and a full 68% of men would. Just let those numbers sink in for a moment: comparing the two numbers for rape and murder, the men in Edwards et al (2014) were in between 2-to-7 times less likely to say they would rape a woman than kill someone if they could, depending on how one interprets their answers. That’s a tremendous difference; one that might even suggest that rape is viewed as a less desirable activity than murder. Now that likely has quite a bit to do with some portion of that murder being viewed as defensive in nature, rather than exploitative, but it’s still some food for thought.

 There are proportionately fewer defensive rapes than defensive stabbings…

This returns us nicely to the politics of fear. The last post addressed people purposefully downplaying the risks posed by terrorist attacks; in this case, we see people purposefully inflating the reported propensities to rape. The 1-in-3 statistic is clearly crafted in the hopes of making an issue seem particularly threatening and large, as larger issues tend to have more altruism directed towards them in the hopes of a solution. As there are social stakes in trying to make one’s problems seem especially threatening, however, this should immediately make people skeptical when dealing with such statistics for the same reasons you shouldn’t let me tell you about how smart or nice I am. There is a very real risk of artificially trying to puff one’s statistics up, as people might come to eventually start not trusting you about things as the default, even for different topics entirely; this should hold true especially if they belong to a group targeted by such misleading results. The undesirable outcomes of such a process being, rather than increases in altruism and sympathy devoted to a real problem, apathy and hostility. Lessons learned from fables like The Boy Who Cried Wolf are timely as ever, it would seem.

References: Edwards, S., Bradshaw, K., & Hinsz, V. (2014). Denying rape but endorsing forceful intercourse: Exploring differences among responders. Violence & Gender, 1, 188-193.

Malamuth, N. (1989). The attraction to sexual aggression scale: Part 1. The Journal of Sex Research, 26, 26-49.

An Eye For Talent

Rejection can be a painful process for almost anyone (unless you’re English). For many, rejection is what happens when a (perhaps overly-bloated) ego ends up facing the reality that it really isn’t as good as it likes to tell people it is. For others, rejection is what happens when the person in charge of making the decision doesn’t possess the accuracy of assessment that they think they do (or wish they did), and failed to recognize your genius. One of the most notable examples of the latter is The Beatle’s Decca audition in 1962, during which the band was told they had no future in show business. Well over 250 million certified sales later, “oops” kind of fails to cut it with respect to how large of a blunder that decision was. This is by no means a phenomenon unique to The Beatles either: plenty of notable celebrities had been previously discouraged or rejected from their eventual profession by others. So we have a bit of error management going on here: record labels want to do things like (a) avoid signing artists that are unlikely to go anywhere while (b) avoiding failures to sign the best-selling band of all time. As they can’t do either of those things with perfect accuracy, they’re bound to make some mistakes.

“Yet again, our talents have gone unnoticed despite our sick riffs”

Part of the problem facing companies that put out products such as albums, books, movies, and the rest, is that popularity can be a terribly finicky thing, since popularity can often snowball on itself. It’s not necessarily the objective properties of a song or book that make it popular; a healthy portion of popularity depends on who else likes it (which might sound circular, but it’s not). This tends to make the former problem of weeding out the bad artists easier than finding the superstars: in most cases, people who can’t sing well won’t sell, but just because one can sing well it doesn’t mean they’re going to be a hit. As we’re about to see, these problems are shared not only by people who put out products like music or movies; they’re also shared by people who publish (or fail to publish) scientific research. A recent paper by Siler, Lee, & Bero (2014) sought to examine how good the peer review process – the process through which journal editors and reviewers decide what gets published and what does not – is at catching good papers and filtering out bad ones.

The data examined by the authors focused on approximately 1,000 papers that had been submitted to three of the top medical journals between 2003 and 2004: Annals of Internal Medicine, British Medical Journal, and The Lancet. Of the 1,008 manuscripts, 946 – or about 94% of them – were rejected. The vast majority of those rejections – about 80% – were desk rejections, which is when an article is not sent out for review before the journal decides to not publish it. From that statistic alone, we can already see that these journals are getting way more submissions than they could conceivably publish or review and, accordingly, lots of people are going to be unhappy with their decision letters. Thankfully, publication isn’t a one-time effort; authors can, and frequently do, resubmit their papers to other journals for publication. In fact, 757 of the rejected papers were found to have been subsequently published in other journals (more might have been published after being modified substantially, which would make them more difficult to track). This allowed Siler, Lee, & Bero (2014) the opportunity to compare the articles that were accepted to those which were rejected in terms of their quality and importance.

Now determining an article’s importance is a rather subjective task, so the authors decided to focus instead on the paper’s citation counts – how often other papers had referenced them – as of April 2014. While by no means a perfect metric, it’s a certainly a reasonable one, as most citations tend to be positive in nature. First, let’s consider the rejected articles. Of the articles that had been desk rejected by one of the three major journals but eventually published in other outlets, the average citation count was 69.8 per article; somewhat lower than the articles which had been sent out for review before they had been rejected (M = 94.65). This overstates the “average” difference by a bit, however, as citation count is not distributed normally. In the academic world, some superstar papers receive hundreds or thousands of the citations, whereas many others hardly receive any. To help account for this, the authors also examined the log-transformed number of citations. When they did so, the mean citation count for the desk rejected papers was 3.44, and 3.92 for the reviewed-then-rejected ones. So that is some evidence consistent with the notion that those who decide whether or not to send papers out for review work as advertised: the less popular papers (which we’re using as a proxy for quality) were rejected more readily, on average.

“I just don’t think they’re room for you on the team this season…”

There’s also evidence that, if the paper gets sent out to reviewers, the peer reviewers are able to assess a paper’s quality with some accuracy. When reviewers send their reviews back to the journal, they suggest that the paper be published as is, with minor/major revisions, or rejected. If those suggestions are coded as numerical values, each paper’s mean reviewer score can be calculated (e.g., fewer recommendations to reject = better paper). As it turns out, these scores correlated weakly – but positively – with an article’s subsequent citation count (r = 0.28 and 0.21 with citation and logged citation counts, respectively), so it seems the reviewers have at least some grasp on the paper’s importance and quality as well. That said, the number of times an article was revised prior to acceptance had no noticeable effect on it’s citation count. While reviewers might be able to discern the good papers from the bad at better-than-chance rates, the revisions they suggested did not appear to have a noticeable impact on later popularity.

What about the lucky papers that managed to get accepted by these prestigious journals? As they had all gone out for peer review, the reviewer’s scores were again compared against citation count, revealing a similarly small but positive correlation (0.21 and 0.26 with citation and logged citation counts). Additionally, the published articles that did not receive any recommendations to reject from the reviewers received higher citation counts on average (162.8 and 4.72) relative to those with at least one recommendation to reject (115.24 and 4.33). Comparing these numbers to the citation counts of the rejected articles, we can see a rather larger difference: articles being accepted by the high-end journals tended to garner substantially more citations than the ones that were rejected, whether before or after peer review.

That said, there’s a complication present in all this: papers rejected from the most prestigious journals tend to subsequently get published in less-prestigious outlets, which fewer people tend to read. As fewer eyes tend to see papers published in less-cited journals, this might mean that even good articles published in worse journals receive less attention. Indeed, the impact factor of the journal (the average citation count of the recent articles published in it) in which an article was published correlated 0.54 with citation and 0.42 with logged citation counts. To help get around that issue, the authors compared the published to rejected-then-published papers in journals with an impact factor of 8 or greater. When they did so, the authors found, interestingly, that the rejected articles were actually cited more than the accepted ones (212.77 vs 143.22 citations and 4.77 and 4.53 logged citations). While such an analysis might bias the number of “mistaken” rejections upwards (as it doesn’t count the papers that were “correctly” bumped down into lower journals), it’s a worthwhile point to bear in mind. It suggests that, above a certain threshold of quality, the acceptance or rejection by a journal might reflect chance differences more than meaningful ones.

But what about the superstar papers? Of the 15 most cited papers, 12 of them (80%) had been desk rejected. As the authors put it, “This finding suggests that in our case study, articles that would eventually become highly cited were roughly equally likely to be desk-rejected as a random submission“. Of the remaining three papers, two had been rejected after review (one of which had been rejected by two of the top 3 journals in question). While it was generally the case, then, that peer review appears to help weed out the “worst” papers, the process does not seem to be particularly good at recognizing the “best” work. Much like The Beatles Decca audition, then, rockstar papers are not often recognized as such immediately. Towards the end of the paper, the authors make reference to some other notable cases of important papers being rejected (one of which being rejected twice for being trivial and then a third time for being too novel).

“Your blindingly-obvious finding is just too novel”

It is worth bearing in mind that academic journals are looking to do more than just publish papers that will have the highest citation count down the line: sometimes good articles are rejected because they don’t fit the scope of the journal; others are rejected just because the journals just don’t have the space to publish them. When that happens, they thankfully tend to get published elsewhere relatively soon after; though “soon” can be a relative term for academics, it’s often within about half a year.

There are also cases where papers will be rejected because of some personal biases on the part of the reviewers, though, and those are the cases most people agree we want to avoid. It is then that the gatekeepers of scientific thought can do the most damage in hindering new and useful ideas because they find them personally unpalatable. If a particularly good idea ends up published in a particularly bad journal, so much the worse for the scientific community. Unfortunately, most of those biases remain hidden and hard to definitively demonstrate in any given instance, so I don’t know how much there is to do about reducing them. It’s a matter worth thinking about.

References: Siler, K., Lee, K., & Bero, L. (2014). Measuring the effectiveness of scientific gatekeeping. Proceedings of the National Academy of Sciences (US), DOI10.1073/pnas.1418218112

Sexed-Up Statistics – Female Genital Mutilation

“A lie can travel halfway around the world while the truth is putting on its shoes” – Mark Twain.

I had planned on finishing up another post today (which will likely be up tomorrow now) until a news story caught my eye this morning, changing my plans somewhat. The news story (found on Alternet) is titled, “Evidence shows that female genital cutting is a growing phenomenon in the US“. Yikes; that certainly sounds worrying. From that title, and subsequent article, it would seem two things are likely to inferred by the reader: (1) There is more female genital cutting in the US in recent years than there was in the past and (2) some kind of evidence supports that claim. There were several facets of the article that struck me as suspect, however, most of which speak to the second point: I don’t think the author has the evidence required to substantiate their claims about FGC. Just to clear up a few initial points, before moving forward with this analysis, no; I’m not trying to claim that FGC doesn’t occur at all in the US or on overseas trips from the US. Also, I personally oppose the practice in both the male and female varieties; cutting pieces off a non-consenting individual is, on my moral scale, a bad thing. My points here only concern accurate scholarship in reporting. They also raise the possibility that the problem may well be overstated – something which, I think, ought to be good news.

It means we can start with just the pitchforks; the torches aren’t required…yet.

So let’s look at the first major alarmist claim of the article: there was a report put out by the Sanctuary for Families that claimed approximately 200,000 women living in the US were living in risk of genital cutting. That number sounds pretty troubling, but the latter part of the claim sounds a bit strange: what does “at risk” mean? I suppose, for instance, that I’m living “at risk” of being involved in a fatal car accident, just as everyone else who drives a car is. Saying that there are approximately 200,000,000 people in the US living at risk of a fatal car crash is useless on its own, though: it requires some qualifications. So what’s the context behind the FGC number? The report itself references a 1997 paper by the CDC that estimated between 150,000 and 200,000 women in the US were at risk of being forced to undergo FGC (which we’ll return to later). Given that the reference for claim is a paper by the CDC, it seems very peculiar that the Sanctuary for Families attaches a citation that instead directs you to another news site that just reiterates the claim.

This is peculiar for two reasons: first, it’s a useless reference. It would be a bit like my writing down on a sheet of paper, “I think FGC is one the rise” because I had read it somewhere, and then referencing the fact that I wrote that down when I say it again the next time.Without directing one the initial source of the claim, it’s not a proper citation and doesn’t add any information. The second reason that the reference is peculiar is that the 1997 CDC paper (or at least what I assume is the paper) is actually freely available online. It took me all of 15 seconds to find it through a Google search. While I’m not prepared to infer any sinister motivation on the Sanctuary for Families for not citing the actual paper, it does, I think speak to the quality of scholarship that went into drafting the report, and in a negative way. It makes one wonder whether they actually read the key report in the first place.

Thankfully, it does finally provide us with the context as to how the estimated number was arrived at. The first point worth noting is that the estimate the paper delivers (168,000) is a reflection of people living in the US who had either already undergone the procedure before they moved here or who might undergo it in the future (but not necessarily within the US). The estimate is mute on when or where the procedure might have taken place. If it happened in another country years or decades ago, it would be part of this estimate. In any case, the authors began with the 1990 census data of the US population. On the census, respondents were asked about their country of origin and how long they lived in the US. From that data, the authors then cross-referenced the estimated rates of FGC in people’s home countries to estimate whether or not they were likely to have undergone the procedure. Further, the authors made the assumption in all of this that immigrants were not unique from the population from which they were derived with respect to their practicing of FGC: if 50% of the population in a families’ country of origin practiced it, then 50% of immigrants were expected to have practiced it or might do so in the future. In other words, the 168,000 number is an estimate, based on other estimates, based on an assumption.

It’s an impressive number, but I worry about its foundation.

I would call this figure, well, a very-rough estimate, and not exactly solid evidence. Further, it’s an estimate of FGC in other countries; not in the US. The authors of the CDC paper were explicit about this point, writing, “No direct information is available on FGC in the United States”. It is curious, then, that the Sanctuary report and the Alternet article both reference the threat of FGC that girls in the US face while referencing the CDC estimate. For example, here’s how the Sanctuary report phrased the estimate:

In 1997, however, the Centers for Disease Control and Prevention (CDC) estimated that as many as 150,000 to 200,000 girls in the United States were at risk of being forced to undergo female genital mutilation.

See the important differences? The CDC estimate wasn’t one concerning people at risk of being forced to undergo the practice; it was an estimate of people who might undergo it and whom might have already undergone it at some point in the past in some other country. Indeed, the CDC document could more accurately be considered an immigration report, rather than an paper on FGC itself. So, when the Sanctuary report and Alternet article suggest that the number of women at risk for FGC is rising, what they appear to mean is that immigration from certain countries where the practice is more common is rising, but that doesn’t seem to have quite the same emotional effect. Importantly, the level of risk isn’t ever qualified. Approximately 200,000,000 people are “at risk” of being involved in a fatal car crash; how many of them actually are involved in one? (about 40,000 a year and on the decline). So how many of the 168,000 women “at risk” for FGC already had one, how many might still be “at risk”, and how many of those “at risk” end up actually undergoing the procedure? Good evidence is missing on these points.

This kind of not-entirely-accurate reporting remind me of a piece by Neuroskeptic on what he called “sexed-up statistics”. These are statistics presented or reported on in such a way as to make some problem seem as bad as possible, most likely in the goal of furthering some social, political, or funding goals (big problems attract money for their solution). It’s come up before in the debate over the wage-gap between men and women, and when considering the extent of rape among college-aged (and non-college aged) women, to just name two prominent cases. This ought not be terribly surprising in light of the fact that the pursuit of dispassionate accuracy is likely not the function of human reasoning. The speed with which people can either accept or reject previously-unknown information (such as the rate of FGC in the US and whether it’s a growing problem) tells us that concerns for accuracy per se are not driving these decisions. This is probably why the initial quote by Mark Twain carries the intuitive appeal that it does.

“Everyone but me and the people I agree with are so easily fooled!”

FGC ought to be opposed, but it’s important to not let one’s opposition for it (or, for that matter, one’s opposition or support for any other specific issue) get in the way of accurately considering and reporting on the evidence at hand (or al least doing the best one can in that regard). The evidence – and that term is used rather loosely here – presented certainly does not show that illegal FGC is a “growing phenomenon in the US”, as Jodie at Alternet suggests. How could the evidence even already show it was a growing problem if one grants that determining the initial and current scope of the problem hasn’t been done and couldn’t even feasibly be done? As far as the “evidence” suggests, the problem could be on the rise, on the decline, or have remained static. One of those options just happens to make for the “sexier” story; the story more capable of making its way halfway around the world in an instant.

How Hard Is Psychology?

The scientific method is a pretty useful tool for assisting people in doing things related to testing hypotheses and discerning truth – or as close as one can come to such things. Like the famous Churchill quote about democracy, the scientific method is the worst system we have for doing so, except for all the others. That said, the scientists who use the method are often not doing so in the single-minded pursuit of truth. Perhaps phrased more aptly, testing hypotheses is generally not done for its own sake: people testing hypotheses are typically doing so for other reasons, such as raising one’s status and furthering one’s career in the process. So, while the scientific method could be used to test any number of hypotheses, scientists tend to try and use for certain ends and to test certain types ideas: those perceived to be interesting, novel, or useful. I imagine that none of that is particularly groundbreaking information to most people: science in theory is different from science in practice. A curious question, then, is given that we ought to expect scientists from all fields to use the method for similar reasons, why are some topics to which the scientific method is applied viewed as “soft” or “hard” (like psychology and physics, respectively)?

Very clever, Chemistry, but you’ll never top Freud jokes.

One potential reason for this impression is that these non-truth-seeking (what some might consider questionable) uses to which people attempt to put the scientific method could simply be more prevalent in some fields, relative to other ones. The further one strays from science in theory to science in practice, the softer your field might be seen as being. If, for instance, psychology was particularly prone to biases that compromises the quality or validity of the data, relative to other fields, then people would be justified in taking a more critical stance towards the findings from it. One of those possible biases involves tending to only report the data consistent with one hypothesis or another. As the scientific method requires reporting the data that is both consistent and inconsistent with one’s hypothesis, if only one of those is being done, then the validity of the method can be compromised and you’re no longer doing “hard” science. A 2010 paper by Fanellli provides us with some reason to worry on that front. In that paper, Fanelli examined approximately 2500 papers randomly drawn from various disciplines to determine the extent to which positive results (those which support one or more of the hypotheses being tested statistically) dominate in the published literature. The Psychology/Psychiatry category sat at the top of the list, with 91.5% of all published papers reporting positive results.

While that number may seem high, it is important to put the figure into perspective: the field at the bottom of that list – the one which reported the fewest positive results overall – were the Space Sciences, with 70.2% of all the sampled published work reporting positive results. Other fields ran a relatively smooth line between the upper- and lower-limits, so the extent to which the fields differ in positive results dominating is a matter of degree; not kind. Physics and Chemistry, for instance, both ran about 85% in terms of positive results, despite both being considered “harder” sciences than psychology. Now that the 91% figure might seem a little less worrying, let’s add some more context to reintroduce the concern: those percentages only consider whether any positive results were reported, so papers that tested multiple hypotheses tended to have a better chance of reporting something positive. It also happened that papers within psychology tended to test more hypotheses on average than papers in other fields. When correcting for that issue, positive results in psychology were approximately five-times more likely than positive results in the space sciences. By comparison, positive results physics and chemistry were only about two-and-a-half-times more likely. How much cause for concern should this bring us?

There are two questions to consider, before answering that last question: (1) what are the causes of these different rates of positive results and (2) are these differences in positive results driving the perception among people that some sciences are “softer” than others? Taking these in order, there are still more reasons to worry about the prevalence of positive results in psychology: according to Fanelli, studies in psychology tend to have lower statistical power than studies in physical science fields. Lower statistical power means that, all else being equal, psychological research should find fewer – not greater – percentages of positive results overall. If psychological studies tend to not be as statistically powerful, where else might the causes of the high-proportion of positive results reside? One possibility is that psychologists are particularly likely to be predicting things that happen to be true. In other words, “predicting” things in psychology tends to be easy because hypotheses tend to only be made after a good deal of anecdata has been “collected” by personal experience (incidentally, personal experience is a not-uncommonly cited reason for research hypotheses within psychology). Essentially, then, predictions in psychology are being made once a good deal of data is already in, at least informally, making them less predictions and more restatements of already-known facts.

“I predict that you would like a psychic reading, on the basis of you asking for one, just now.”

A related possibility is that psychologists might be more likely to engage in outright-dishonest tactics, such actually collecting their data formally first (rather than just informally), and then making up “predictions” that restate their data after the fact. In the event that publishers within different fields are more or less interested in positive results, then we ought to expect researchers within those fields to attempt this kind of dishonesty on a greater scale (it should be noted, however, that the data is still the data, regardless of whether it was predicted ahead of time, so the effects on the truth-value ought to be minimal). Though greater amounts of outright dishonesty is a possibility, it would be unclear as to why psychology would be particularly prone to this, relative to any other field, so it might not be worth worrying too much about. Another possibility is that psychologists are particularly prone to using questionable statistical practices that tend to boost their false-positive rates substantially, an issue which I’ve discussed before.

There are two issues above all the others stand out to me, though, and they might help to answer the second question – why psychology is viewed as “soft” and physics as “hard”. The first issue has to do with what Fanelli refers to as the distinction between the “core”  and the “frontier” of a discipline. The core of a field of study represents the agreed upon theories and concepts on which the field rests; the frontier, by contrast, is where most of the new research is being conducted and new concepts are being minted. Psychology, as it currently stands, is largely frontier-based. This lack of a core can be exemplified by a recent post concerning “101 greats insights from psychology 101“. In the list, you’ll find the word “theory” used a collective three times, and two of those mentions concern Freud. If you consider the plural – “theories” – instead, you’ll find five novel uses of the term, four of which mention no specific theory. The extent to which the remaining two uses represent actual theories, as opposed to redescriptions of findings, is another matter entirely. If one is left with only a core-less frontier of research, that could well send the message that the people within the field don’t have a good handle on what it is they’re studying, thus the “soft” reputation.

The second issue involves the subject matter itself. The “soft” sciences – psychology and its variants (like sociology and economics) – seem to dabble in human affairs. This can be troublesome for more than one reason. A first reason might involve the fact that the other humans reading about psychological research are all intuitive psychologists, so to speak. We all have an interest in understanding the psychological factors that motivate other people in order to predict what they’re going to do. This seems to give many people the impression that psychology, as a field, doesn’t have much new information to offer them. If they can already “do” psychology without needing explicit instructions, they might come to view psychology as “soft” precisely because it’s perceived as being easy. I would also note that this suggestion ties neatly into the point about psychologists possibly tending to make many predictions based on personal experience and intuitions. If the findings they are delivering tend to give people the impression that “Why did you need research? I could have told you that”, that ease of inference might cause people to give psychology less credit as a science.

“We go to the moon because it is hard, making physics a real science”

The other standout reason as to why psychology might pose people with the soft perception is that, on top of trying to understand other people’s psychological goings-on, we also try to manipulate them. It’s not just that we want to understand why people support or oppose gay marriage, for instance, it’s that we might also want to change their points of view. Accordingly, findings from psychology tend to speak more directly to issues people care a good deal about (like sex, drugs, and moral goals. Most people don’t seem to argue over the latest implications from chemistry research), which might make people either (a) relatively resistant to the findings or (b) relatively accepting of them, contingent more on one’s personal views and less on the scientific quality of the work itself. This means that, in addition to many people having a reaction of “that is obvious” with respect to a good deal of psychological work, they also have the reaction of “that is obviously wrong”, neither of which makes psychology look terribly important.

It seems likely to me that many of these issues could be mediated with the addition of a core to psychology. If results need to fit into theory, various statistical manipulations might become somewhat easier to spot. If students were learning how to think about psychology, rather than to think about and remember lists of findings which they feel are often trivial or obviously wrong, they might come away with a better impression of the field. Now if only a core could be found

References: Fanelli D (2010). “Positive” results increase down the Hierarchy of the Sciences. PloS one, 5 (4) PMID: 20383332

Do People “Really” Have Priors?

As of late, I’ve been dipping my toes ever-deeper into the conceptual world of statistics. If one aspires towards understanding precisely what they’re seeing in when it comes to research in psychology, understanding statistics can go a long way. Unfortunately, the world of statistics is a contentious one and the concepts involved in many of these discussions can be easily misinterpreted, so I’ve been attempting to be as cautious as possible in figuring the mess out. Most recently, I’ve been trying to decipher whether the hype over Bayesian methods is to be believed. There are some people who seem to feel that there’s a dividing line between Bayesian and Frequentist philosophies that one must choose sides over (Dienes, 2011), while others seem to suggest that such divisions are basically pointless and the field has moved beyond them (Gelman, 2008; Kass, 2011). One of the major points which has been bothering me about the Bayesian side of things is the conceptualization of a “prior” (though I feel such priors can easily be incorporated in Frequentist analyses as well, so this question applies well to any statistician). Like many concepts in statistics, this one seems to both be useful in certain situations and able to easily lead one astray in others. Today I’d like to consider a thought experiment dealing with the latter cases.

Thankfully, thought experiments are far cheaper than real ones

First, a quick overview of what a prior is and why they can be important. Here’s an example that I discussed previously:

say that you’re doctor trying to treat an infection that has broken out among a specific population of people. You happened to know that 5% of the people in this population are actually infected and you’re trying to figure out who those people are so you can at least quarantine them. Luckily for you, you happen to have a device that can test for the presence of this infection. If you use this device to test an individual who actually has the disease, it will come back positive 95% of the time; if the individual does not have the disease, it will come back positive 5% of the time. Given that an individual has tested positive for the disease, what is the probability that they actually have it? The answer, unintuitive to most, is 50%.

In this example, your prior (bolded) is the percent of people who have the disease. The prior is, roughly, what beliefs or uncertainties you come to your data with. Bayesian analysis requires one to explicitly state one’s prior beliefs, regardless of what those priors are, as they will eventually play a role in determining your conclusions. Like in the example above, priors can be exceptionally useful when they’re known values.

In the world of research it’s not always (or even generally) the case that priors are objectively known: in fact, they’re basically what we’re trying to figure out in the first place. More specifically, people are actually trying to derive posteriors (prior beliefs that have been revised by the data), but one man’s posteriors are another man’s priors, and the line between the two is more or less artificial. In the previous example, we took the 5% prevalence in the population is taken as a given; if you didn’t know that value and only had the results of your 95% effective test, figuring out how many of your positives were likely false-positive and, conversely, how many of your negatives were likely false-negatives, would be impossible values to accurately estimate (except if you got lucky). If the prevalence of the disease in the population is very low, you’ll have many false-positives; if the prevalence is very high, you’ll likely have many false-negatives. Accordingly, what prior beliefs you bring to your results will have a substantial effect on how they’re interpreted.

This is a fairly common point discussed when it comes to Bayesian analysis: the  frequent subjectivity of priors. Your belief about whether a disease is common or not doesn’t change the actual prevalence of it; just how you will eventually look at your data. This means that researchers with the same data can reach radically different conclusions on the basis on different priors. So, if one is given free-reign over which priors they want to use, this could allow confirmation bias to run wild and a lot of disagreeable data to be all but disregarded. As this is a fairly common point in the debate over Bayesian statistics, there’s already been a lot of ink (virtual and actual) spilled over it, so I don’t want to continue on with it.

There is, however, another issue concerning priors that, to the best of my knowledge, has not been thoroughly addressed. That question is to what extent we can consider people to have prior beliefs in the first place? Clearly, we feel that some things are more likely than others: I think it’s more likely that I won’t win the lottery than I will. No doubt you could immediately provide a list of things you think are more or less probable than others with ease. That these feelings can be so intuitive and automatically generated helps to mask an underlying problem with them: strictly speaking, it seems we ought to either not update our priors at all or not say that we “really” have any. A shocking assertion, no doubt, (and maybe a bit hyperbolic) but I want to explore it and see where it takes us.

Whether it’s to a new world or to our deaths, I’ll still be famous for it.

We can begin to explore this intuition with another thought experiment involving flipping a coin, which will be our stand-in for a random-outcome generator. Now this coin is slightly biased in a way that results in 60% of the flips coming up heads and the remaining 40% coming up tails. The first researcher has his entire belief centered 100% on the coin being 60% biased towards heads and, since there is no belief left to assign, thinks that all other states of bias are impossible. Rather than having a distribution of beliefs, this researcher has a single point. This first researcher will never update his belief about the bias of the coin no matter what outcomes he observed; he’s certain the coin is biased in a particular way. Because he just so happens to be right about the bias he can’t get any better and this is lack of updating his priors is a good thing (if you’re looking to make accurate predictions, that is).

Now let’s consider a second researcher. This researcher comes to the coin with a different set of priors: he thinks that the coin is likely fair, say 50% certain, and then distributes the rest of his belief equally between two additional potential values of the coin not being fair (say 25% sure that the coin is 60% biased towards heads and 25% sure that the coin is similarly biased towards tails). The precise distribution of these beliefs doesn’t matter terribly; it could come in the form of two or an infinite number of points. All that matters is that, because this researcher’s belief is distributed in such a way that it doesn’t lie on a single point, they are capable of being updated by the data from the coin flips. Researcher two, like a good Bayesian, will then update his priors to posteriors on the basis of the observed flips, then turn those posteriors into new priors and continues on updating for as long as he’s getting new data.

On the surface, then, the major difference between the two is that researcher one refuses to update his priors and researcher two is willing to do so. This implies something rather interesting about the latter researcher: researcher two has some degree of uncertainty about his priors. After all, if he was already sure he had the right priors, he wouldn’t update, since he would think he could do not better in terms of predictive accuracy. If researcher two is uncertain about his priors, then, shouldn’t that degree of uncertainty similarly be reflected somehow?

For instance, one could say that researcher two is 90% certain that he got the correct priors and 10% certain that he did not. That would represent his priors about his priors. He would presumably need to have some prior belief about the distribution he initial chose, as he was selecting from an infinite number of other possible distributions. His prior about his priors, however, must have its own set of priors as well. One can quickly see that this leads to an infinite regress: at some point, researcher two will basically have to admit complete uncertainty about his priors (or at least uncertainty about how they ought to be updated, as how one updates their priors depends upon the priors one is using, and there are an infinite number of possible distributions of priors), or admit complete certainty in them. If researcher two ends up admitting to complete uncertainty, this will give him a flat set of priors that ought to be updated very little (he will be able to rule out 100% biased towards heads or tails, contingent on observing either a heads or tails, but not much beyond that). On the other hand, if researcher two ends up stating one of his priors with 100% certainty, the rest of the priors ought to collapse on each other to 100% as well, resulting in an unwillingness to update.

Then again, math has never been specialty. I’m about 70% sure it isn’t, and about 30% sure of that estimate…

It is not immediately apparent how we can reconcile these two stances with each other. On the one hand, researcher one has a prior that cannot be updated; on the other, researcher two has a potentially infinite number of priors with almost no idea how to update them. While we certainly could say that researcher one has a prior, he would have no need for Bayesian analysis. Given that people seem to have prior beliefs about things (like how likely some candidate is to win an election), and these beliefs seem to be updated from time to time (once most of the votes have been tallied), this suggests that something about the above analysis might be wrong. It’s just difficult to place precisely what that thing is.

One way of ducking the dilemma might be to suggest that, at any given point in time, people are 100% certain of their priors, but what point they’re certain about change over time. Such a stance, however, suggests that priors aren’t updated so much as priors just change, and I’m not sure that such semantics can save us here. Another suggestion that was offered to me is that we could just forget the whole thing as priors themselves don’t need to themselves have priors. A prior is a belief distribution about probability and probability is not a “real” thing (that is the biased coin doesn’t come up 60% and 40% tails per flip; the result will either be a heads or a tails). For what it’s worth, I don’t think such a suggestion helps us out. It would essentially seem to be saying that, out of the infinite number of beliefs one could start with, any subset of those beliefs is as good as any other, even if they lead to mutually-exclusive or contradictory results and we can’t think about why some of them are better than others. Though my prior on people having priors might have been high, my posteriors about them aren’t looking so hot at the moment.

References: Dienes, Z. (2011). Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspectives on Psychological Science, 6 (3), 274-290 DOI: 10.1177/1745691611406920

Gelman, A. (2008). Rejoinder. Bayesian Analysis, 3, 467-478.

Kass, R. (2011). Statistical inference: The big picture. Statistical Science, 26, 1-9.

Statisticial Issues In Psychology And What Not To Do About Them

As I’ve discussed previously, there are a number of theoretical and practical issues that plague psychological research in terms of statistical testing. On the theoretical end of things, if you collect enough subjects, you’re all but guaranteed to find some statistically significant result, no matter how small or unimportant it might be. On the practical end of things, even if a researcher is given a random set of data they can end up finding a statistically significant (though not actually significant) result more often than they don’t by exercising certain “researcher degrees of freedom”. These degrees of freedom can take many forms, from breaking the data down into different sections, such as by sex, or high, medium, and low values of the variable of interest, or peaking at the data ahead of time and using that information to decide when to stop collecting subjects, among other methods. At the heart of many of these practical issues is the idea that the more statistic tests you can run, the better your chances of finding something significant. Even if the false-positive rate for any one test is low, with enough tests, the chances of a false-positive result rises dramatically. For instance, running 20 tests with an alpha of 0.05 on random data would result in a false-positive around 64% of the time.

“Hey every body, we got one; call off the data analysis and write it up!”

In attempts to banish false-positives from the published literature, some have advocated the use of what are known as Bonferroni corrections. The logic here seems simple enough: the more tests you run, the greater the likelihood that you’ll find something by chance so, to better avoid fluke results, you raise the evidentiary bar for each statistical test you run (or, more precisely, lower your alpha level). So, if you were to run the same 20 tests on random data as before, you can maintain an experiment-wide false-positive rate of 5% (instead of 64%) by adjusting your per-experiment error-rate to approximately 0.25% (instead of 5%). The correction, then, makes each test you do more conservative as a function of the total number of tests you run. Problem solved, right? Well, no; not exactly. According to Perneger (1998), these corrections not only fail to solve the initial problem we were interested in, but also create a series of new problems that we’re better off avoiding.

Taking these two issues in order, the first is that the Bonferroni correction will only serve to keep the experiment-wide false-positive rate a constant. While it might do a fine job at that, people very rarely care about that number. That is, we don’t care about whether there is a false-positive finding; we care about whether a specific finding is a false positive, and these two values are far from the same thing. To understand why, let’s return to our researcher who was running 20 independent hypothesis tests. Let’s say that, hypothetically, out of those 20 tests, 4 come back as significant at the 0.05 level. Now we know that the probability of making at least one type 1 error (false-positives) is 64%; what we don’t know is (a) whether any of our positive results are false-positives or, assuming at least one of them is, (b) which result(s) that happens to be. The most viable solution to this problem, in my mind, is not to raise the evidentiary bar across all tests, threatening to make all the results insignificant on account of the fact that one of them might just be a fluke.

There are two major reasons for not doing this: the first is that it will dramatically boost our type 2 error rate (failing to find an effect when one actually exists) and, even though this error rate is not the one that many conservative statisticians are predominately interested in, they’re still errors all the same. Even more worryingly, though, it doesn’t seem to make much sense to deem a result significant or not contingent on what other results you were examining. Consider two experimenters: one collects data on three variables of interest from the same group of subjects while a second researcher collects data on those three variables of interest, but from three different groups. Both researchers are thus running three hypothesis tests, but they’re either running them together or separately. If the two researchers were using a Bonferroni correction contingent on the number of tests they ran per experiment, the results might be significant in the latter case but not in the former, even the two researchers got identical sets of results. This lack of consistency in terms of which results get to be counted as “real” will only add to the confusion in the psychological literature.

“My results would have been significant, if it wasn’t for those other meddling tests!”

The full scale of the last issue might not have been captured by the two researcher example, so let’s consider another, single researcher example. Here, a researcher is giving a test to a group of subjects with the same 20 variables of interest, looking for differences between men and women. Among these variables, there is one hypothesis that we’ll call a “real” hypothesis: women will be shorter than men. The other 19 variables being assessed are being used to test “fake” hypotheses: things like whether men or women have a preference for drinking out of blue cups or whether they prefer green pens. A Bonferroni correction would, essentially, treat the results of the “fake” hypotheses as being equally as likely to generate a false-positive as the “real” hypothesis. In other words, Bonferroni corrections are theory-independent. Given that some differences between groups are more likely to be real than others, applying a uniform correction to all those tests seems to miss the mark.

To build on that point, as I initially mentioned, any difference between groups, no matter how small, could be considered statistically significant if your sample size is large enough due to the way that significance is calculated; this is one of the major theoretical criticisms of null hypothesis testing. Conversely, however, any difference, no matter how large, could be considered statistically insignificant if you run enough additional irrelevant tests and apply a Bonferroni correction. Granted, in many cases that might require a vast number of additional tests, but the precise number of tests is not the point. The point is that, on a theoretical level, the correction doesn’t make much sense.

While some might claim that the Bonferroni correct guards against researchers making excessive, unwarranted claims, there are better ways of guarding against this issue. As Perneger (1998) suggests, if researchers simply describes what they did (“we ran 40 tests and 3 were significant, but just barely”), that can generally be enough to help readers figure out whether the results were likely to be the chance outcomes of a fishing expedition or not. The issue is that this potential safeguard is that it would require researchers to accurately report all their failed manipulations as well their successful ones, which, for their own good, many don’t seem to do. One guard that Perneger (1998) does not explicitly mention which can get around that reporting issue, however, is the importance of theory in interpreting the results. As most psychological literature currently stands, results are simply redescribed, rather than explained. In this world of observations equaling explanations and theory, there is little way to separate out the meaningful significant results from the meaningless ones, especially when publication bias generally hinders the failed experiments from making it into print.

What failures-to-replicate are you talking about?

So long as people continue to be impressed by statistically significant results, even when those results cannot be adequately explained or placed into some larger theoretical context, these statistical problems will persist. Applying statistical corrections will not solve, or likely even stem, the research issues in the way psychological research is current conducted. Even if such corrections were honestly and consistently applied, they would likely only change the way psychological research is conducted, with researchers turn to an altogether less-efficient means in order to compensate for the reduced power (running one hypothesis per experiment, for instance).  Rather than demanding a higher standard of evidence for fishing expeditions, one might instead focus on reducing the prevalence of these fishing expeditions in the first place.

References: Perneger TV (1998). What’s wrong with Bonferroni adjustments? BMJ (Clinical research ed.), 316 (7139), 1236-8 PMID: 9553006

Are Associations Attitudes?

If there’s one phrase that people discussing the results of experiments have heard more than any other, a good candidate might be “correlation does not equal causation”. Correlations can often get mistaken for (at least implying) causation, especially if the results are congenial to a preferred conclusion or interpretation. This is a relatively uncontroversial matter which has been discussed to death, so there’s little need to continue on with it. There is, however, a related reasoning error people also tend to make with regard to correlation; one that is less discussed than the former. This mistake is to assume that a lack of correlation (or a very low one) means no causation. Here are two reasons one might find no correlation, despite underlying relationships: in the first case, no correlation could result from something as simple as there being no linear relationship between two variables. As correlations only measure linear relationships, distributions that resemble bell curves would tend to yield correlations equal to zero.

For the second case, consider the following example: event A causes event B, but only in the absence of variable C. If variable C randomly varies (it’s present half the time and absent the other half), [EDIT: H/T Jeff Goldberg] you might end up with no correlation, or at least a very reduced one, despite direct causation. This example becomes immediately more understandable if you relabel “A” as heterosexual intercourse, “B” as pregnancy, and “C” as contraceptives (ovulation works too, provided you also replace “absence” with presence). That said, even if contraceptives aren’t in the picture, the correlation between sexual intercourse and pregnancy is still pretty low.

And just in case you find that correlation reaching significance, there’s always this.

So why all this talk about correlation and causation? Two reasons: first, this is my website and I find the matter pretty neat. More importantly, though, I’d like to discuss the IAT (implicit association test) today; specifically, I’d like to address the matter of how well the racial IAT correlates (or rather, fails to correlate) with other measures of racial prejudice, and how we ought to interpret that result. While I have touched on this test very briefly before, it was in the context of discussing modularity; not dissecting the test itself. Since the IAT has recently crossed my academic path again on more than one occasion, I feel it’s time for a more complete engagement with it. I’ll start by discussing what the IAT is, what many people seem to think it measures, and finally what I feel it actually assesses.

The IAT was introduced by Greenwald et al in 1998. As per its namesake, the test was ostensibly designed to do something it would appear to do fairly well: measure the relative strengths of initial, automatic cognitive associations between two concepts. If you’d like to see how this test works firsthand, feel free to follow the link above, but, just in case you don’t feel like going through the hassle, here’s the basic design (using the race-version of the test): subjects are asked to respond as quickly as possible to a number of stimuli. In the first phase, subjects will view pictures of black and white faces flashed on the screen and asked to press one key if the face is black and another if it’s white. In the second phase, subjects will do the same task, but this time they’ll press one key if the word that flashes on the screen is positive and another if it’s negative. Finally, these two tasks are combined, with subjects asked to press one key if the face is white or the word is positive, and another key if the face is black or the word is negative (these conditions then flip). Different reaction times in this test are taken to be measures of implicit cognitive associations. So, if you’re faster to categorize black faces with positive words, you’re said to have a more positive association towards black people.

Having demonstrated that many people seem to show a stronger association between white faces and positive concepts, the natural question arises about how to interpret these results. Unfortunately, many psychological researchers and laypeople alike have taken a unwarranted conceptual leap: they assume that these differential association strengths imply implicit racist attitudes. This assumption happens to meet with an unfortunate snag, however, which is that these implicit associations tend to have very weak to no correlations with explicit measures of racial prejudice (even if the measures themselves, like the Modern Racism Scale, are of questionable validity to begin with). Indeed, as reviewed by Arkes & Tetlock (2004), whereas the vast majority of undergraduates tested manifest exceedingly low levels of “modern racism”, almost all of them display a stronger association between white faces and positivity. Faced with this lack of correlation, many people have gone on to make a second assumption to account for this lack, that assumption being that the implicit measure is able to tap some “truer” prejudiced attitude that the explicit measures are not as able to tease out. I can’t help but wonder, though, what those same people would have had to say if positive correlations had turned up…

“Correlations or no, there’s literally no data that could possibly prove us wrong”

Arkes & Tetlock (2004) put forth three convincing reasons to not make that conceptual jump from implicit associations to implicit attitudes. Since I don’t have the space to cover all their objections, I’ll focus on the key points of them. The first is one that I feel ought to be fairly obvious: quicker associations between whites and positive concepts are capable of being generated by merely being aware of racial stereotypes, irrespective of whether one endorses them on any level, conscious or not. Indeed, even African American subjects were found to manifest pro-white biases in these tests. One could take those results as indicative of black subjects being implicit racist against their own ethnic group, though it would seem to make more sense to interpret those results in terms of the black subjects being aware of the stereotypes they did not endorse. The latter interpretation also goes a long way towards understanding the small and inconsistent correlations between the explicit and implicit measures; the IAT is measuring a different concept (knowledge of stereotypes) than the explicit measures (endorsement of stereotypes).

In order to appreciate the next criticism of this conceptual leap, there’s an important point worth bearing in mind concerning this IAT: the test doesn’t measure where two concepts are associated in any sense whatsoever; it merely measures relative strengths of these associations (for example, “bread” might be more strongly associated with “butter” than it is with “banana”, though it might be more associated with both than with “wall”). This importance of this point is that the results of the IAT do not test whether there is a negative association towards any one group; just whether one group is rated more positively than another. While whites might have a stronger association with positive concepts than blacks, it does not follow that blacks have a negative association overall, nor that whites have a particularly positive one either. Both groups could be held in high or low regard overall, with one being slightly favored. In much the same way, I might enjoy eating both pizza and turkey sandwiches, but I would tend to enjoy eating pizza more. Since the IAT does not track whether these response time differentials are due to hostility, these results do not automatically seem to apply well to most definitions of prejudice.

Finally, the authors make the (perhaps politically incorrect) point that noticing behavioral differences between groups – racial or otherwise – and altering behavior accordingly is not, de facto, evidence of an irrational racial biases; it could well represent the proper use of Bayesian inference, passing correspondence benchmarks for rational behavior. If one group, A, happens to perform behavior X more than group B, it would be peculiar to ignore this information if you’re trying to predict the behavior of an individual from one of those groups. In fact, when people fail to do as much in other situations, people tend to call that failure a bias or an error. However, given that race is touchy political subject, people tend to condemn others for using what Arkes & Tetlock (2004) call “forbidden base rates”. Indeed, the authors report that previous research found subjects were willing to condemn an insurance company for using base rate data for the likelihood of property damage in certain neighborhoods when that base rate also happened to correlate with the racial makeup of that neighborhood (but not when those racial correlates were absent).

A result which fits nicely with other theory I’ve written about, so subscribe now and don’t miss any more exciting updates!

To end this on a lighter, (possibly) less politically charged note, a final point worth considering is that this test measures the automaticity of activation; not necessarily the pattern of activation which will eventually obtain. While my immediate reaction towards a brownie within the first 200 milliseconds might be “eat that”, that doesn’t mean that I will eventually end up eating said brownie, nor would it make me implicitly opposed toward the idea of dieting. It would seem that, in spite of these implicit associations, society as a whole has been getting less overtly racist. The need for researchers to dig this deep to try and study racism could be taken as heartening, given that we, “now attempt to gauge prejudice not by what people do, or by what people say, but rather by millisecs of response facilitation of inhibition in implicit association paradigms” (p.275). While I’m sure there are still many people who will make a lot about these reaction time differentials for reasons that aren’t entirely free from their personal politics, it’s nice to know just how much successful progress our culture seems to have made towards eliminating racism.

References: Arkes, H.R., & Tetlock, P.E. (2004). Attributions of implicit prejudice, or “Would Jesse Jackson ‘fail’ the implicit association test?” Psychological Inquiry , 15, 257-278

Greenwald, A.G., McGhee, D.E., & Schwartz, J.L.K. (1998). Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74, 1464-1480