“Couldn’t-Even-Possibly-Be-So Stories”: Just-World Theory

Posted on January 20, 2013 by Jesse Marczyk

While I was reading over a recent paper by Callan et al (2012) about the effects that a victim’s age has on people’s moral judgments, I came across something that’s particularly – and disappointingly – rare in most of the psychological literature: the authors explicitly thinking about possible adaptive functions of our psychology. That is to say, the authors were considering what adaptive problem(s) some particular aspect of human psychology might be designed to solve. In that regard, I would praise the authors for giving these important matters some thought. Their major stumbling point, however, is that the theory the authors reference, just-world theory, suggests an implausible function; one that couldn’t even potentially be correct.

Just-world theory, as presented by Hafer (2000), is a very strange kind of theory. It begins with the premise that people have a need to believe in a just or fair world, so people think that others shouldn’t suffer or gain unless they did something to deserve it. More precisely, “good” people are supposed to be rewarded and “bad” people are supposed to be punished, or something like that, anyway. When innocent people suffer, then, this belief is supposedly “threatened”, so, in order to remove the threat and maintain their just-world belief, people derogate the victim. This makes the victim seem less innocent and more deserving of their suffering, so the world can again be viewed as just.

I’ll bet that guy made Santa’s “naughty” list.

Phrased in terms of adaptationist reasoning, just-world theory would go something like this: humans face the adaptive problem of maintaining a belief in a just world in the face of contradictory evidence. People solve this problem with cognitive mechanisms that function to alter that contradictory evidence into confirmatory evidence. The several problems with this suggestion ought to jump out clearly at this point, but let’s take them one at a time and examine them in some more depth. The first issue is that the adaptive problem being posited here isn’t one; indeed, it couldn’t be. Holding a belief, regardless of whether that belief is true or not, is a lot like “feeling good”, in that neither of them, on their own, actually do anything evolutionary useful. Sure, beliefs (such as “Jon is going to attack me”) might motivate you to execute certain behaviors (running away from Jon), but it is those behaviors which are potentially useful; not the beliefs per se. Natural selection can only “see” what you do; not what you believe or how you feel. Accordingly, an adaptive problem could not even potentially be the maintaining of a belief.

But let’s assume for the moment that maintaining a belief could be a possible adaptive problem. Even granting this, just-world theory runs directly into a second issue: why would contradictory evidence “threaten” that belief in the first place? It seem perfectly plausible that an individual could simply believe whatever it is was important to believe and be done with it, rather than trying to rationalize that belief to ensure it’s consistent with other beliefs or accurate. For instance, say, for whatever reason, it’s adaptively important for people to believe that anyone who leaves their house at night will die. Then someone who believes this observes their friend Max leaving the house at night and return very much alive. The observer in this case could, it seems, go right along believing that anyone who leaves their house at night will die without also needing to believe either that (a) Max didn’t leave his house at night or (b) Max isn’t alive. While the observer might also believe one or both of those things, it would seem to be irrelevant as to whether or not they did.

On a related note, it’s also worth noting that just-world theory seems to imply that the adaptive goal here is to hold an incorrect belief – that “the world” is just. Now there’s nothing implausible about the suggestion that an organism can be designed to be strategically wrong in certain contexts; when it comes to persuading others, for instance, being wrong can be an asset at times. When you aren’t trying to persuade others of something, however, being wrong will, at best, be neutral to, at worst, exceedingly maladaptive. So what does Hafer (2000) suggest the function of such incorrect beliefs might be?

By [dissociating from an innocent victim], observers can at least be comforted that although some people are unjustly victimized in life, all is right with their own world and their own investments in the future (emphasis mine)

As I mentioned before, this explanation couldn’t even possibly work, as “feeling good” isn’t one of those things that does anything useful by itself. As such, maintaining an incorrect belief for the purposes of feeling good fails profoundly as a proper explanation for any behavior.

Not only is the world not just, it isn’t tuned into your thought frequencies either, no matter how strongly you incorrectly believe it is.

On top of all the aforementioned problems, there’s also a major experimental problem: the just-world theory only seems to have been tested in one direction. Without getting too much into the methodological details of her studies, Hafer (2000) found that when a victim was “innocent”, subjects who were primed for thinking about their long-term plans were slightly more likely to blame the victim for their negative life outcome, derogate them, and disassociate from them (i.e. they should have been more cautious and what happened to them is not likely to happen to me), relative to subjects who were not primed for the long term. Hafer’s interpretation of these results was that, at least in the long-term condition, the innocent victim threatened the just-world belief, so people in turn perceived the victim as less innocent.

While the innocent-victims-being-blamed angle was examined, Hafer (2000) did not examine the opposite context: that of the undeserving recipient. Let’s say there was someone you really didn’t like, and you found out that this someone recently came into a large sum of money through an inheritance. Presumably, this state of affairs would also “threaten” your just-world belief; after all, bad people are supposed to suffer, not benefit, so you’d be left with a belief-threatening inconsistency. If we presented subjects with a similar scenario, would we expect them to “protect” their just-world belief by reframing their disliked recipient as a likable and deserving one? While I admittedly have no data bearing on that point, my intuitive answer to the question would be a resounding “probably not”; they’d probably just view their rival as richer pain-in-their-ass after receiving the cash. It’s not as if intuitions about who’s innocent and guilty seem to shift simply on the basis of received benefits and harms; the picture is substantially more nuanced.

“If he was such a bad guy, why was he pepper-spraying people instead of getting sprayed?”

To reiterate, I’m happy to see psychologists thinking about functions when developing their research; while such a focus is by no means sufficient for generating good research or sensibly interpreting results (as we’ve just seen), I think it’s an important step in the right direction. The next major step would be for psychological researchers to better learn how to differentiate plausible and non-plausible functions, and for that they need evolutionary theory. Without evolutionary theory, ostensible explanations like “feeling good” and “protecting beliefs” can be viewed as acceptable and, in some cases, even as useful, despite them being anything but.

References: Callan, M., Dawtry, R., & Olson, J. (2012). Justice motive effects in ageism: The effects of a victim’s age on observer perceptions of injustice and punishment judgments Journal of Experimental Social Psychology, 48 (6), 1343-1349 DOI: 10.1016/j.jesp.2012.07.003

Hafer, C. (2000). Investment in Long-Term Goals andCommitment to Just Means Drive
the Need to Believe in a Just World Personality and Social Psychology Bulletin, 26, 1059-1073

The Tension Between Theory And Reality

Posted on January 2, 2013 by Jesse Marczyk

“In theory, theory and practice are the same. In practice, they are not.”

There is a relatively famous quote attributed to Michelangelo who was discussing his process of carving a statue: “I saw the angel in the marble and carved until I set him free”. Martin Nowak, in his book SuperCooperators (2011), uses that quote to talk about his admiration for using mathematical models to study cooperation. By stripping away the “noise” in the world, one can end up with some interesting conclusions. For instance, it was through this stripping away of the noise that led to the now-famous programming competition that showed us how successful a tit-for-tat strategy can be. There’s just one hitch, and it’s expressed in another relatively famous quote attributed to Einstein: “Whether you can observe a thing or not depends on the theory which you use. It is the theory which decides what can be observed.” Imagine instead that Michelangelo had not seen an angel in the marble, but rather a snake: he would have “released” the snake from the marble instead. That Michelangelo “saw” the angel in the first place seemed to preclude his seeing the snake – or any number of other possible images – that might have potentially been representable by the marble as well. I should probably also add that neither the snake nor the angel were actually “in” the marble in the first place…

“You see a medium for high art; I see new kitchen countertops”

The reason I bring up Nowak’s use of the Michelangelo quote is that both in his book and a recent paper (Nowak, 2012), Nowak stresses the importance of both (a) using mathematical models to reveal underlying truths by stripping away noise from the world, and (b) advocates for the readdition of that noise, or at least some of it, to make the models better at predicting real-world outcomes. The necessity of this latter point is demonstrated neatly by the finding that, as the rules of the models designed to assess cooperation shifted slightly, the tit-for-tat strategy no longer emerged as victorious. When new variables – ones previously treated as noise – are introduced to these games, new strategies can best tit-for-tat handily. Sometimes the dominant strategy won’t even remain static over time, shifting between patterns of near universal cooperation, universal defection, and almost anything in between. That new pattern of results doesn’t mean that a tit-for-tat strategy isn’t useful on some level; just that it’s usefulness is restricted to certain contexts, and those contexts may or may not be represented in any specific model.

Like Michelangelo, then, these theoretical models can “see” any number of outcomes (as determined by the initial state of the program and its governing rules); like Einstein, these models can also only “see” what they are programmed to see. Herein lies the tension: these models could be excellent for demonstrating the many things (like group selection works), but many of many those things which can be demonstrated in the theoretical realm are not applicable to the reality that we happen to live in (also like group selection). The extent to which those demonstrations are applicable to the real world relies on the extent to which the modeller happened to get things right. For example, let’s say we actually had a slab of marble with something inside it and it’s our goal to figure out what that something is: a metaphorical description of doing science. Did Michelangelo demonstrate that this something was the specific angel he had in mind by removing everything that wasn’t that angel from an entirely different slab of marble? Not very convincingly; no. He might have been correct, but there’s no way to tell without actually examining the slab with that something inside of it directly. Because of this, mathematical models do not serve as a replacement for experimentation or theory in any sense.

On top of that concern, a further problem is that, in the realm of the theoretical, any abstract concept (like “the group”) can be granted as much substance as any other, regardless of whether those concepts can be said to exist in reality; one has a fresh slab of marble that they can “see” anything in, constrained only by their imagination and programming skills. I could, with the proper technical know-how, create a mathematical model that demonstrates that people with ESP have a fitness advantage over those without this ability. By contrast, I could create a similar model that demonstrates that people without ESP have a fitness advantage over those with the ability. Which outcome will eventually obtain depends entirely on the ways in which I game my model in favor of one conclusion or the other. Placed in that light, (“we defined some strategy as working and concluded that it worked”) the results of mathematical modeling seem profoundly less impressive. More to the point, however, the outcome of my model says nothing about whether or not people actually have these theoretical ESP abilities in the first place. If they don’t, all the creative math and programming in the world wouldn’t change that fact.

Because, eventually, Keanu Reeves will stop you.

As you can no doubt guess by this point, I don’t hold mathematical modeling in the same high esteem that Nowak seems to. While its theoretical utility is boundless, its practical utility seems extremely limited, relying on the extent to which the assumptions of the programmer approach reality. With that in mind, I’d like to suggest a few other details that have not yet seemed to have been included in these models of cooperation. That’s not to say that the inclusion of these variables would allow a model to derive some new and profound truths – as these models can only see what they are told to see and how they are told to see it – just that these variables might help, to whatever degree, the models better reflect reality.

The first of these issues is that these cooperation games seem to be played using an identical dilemma between rounds; that is to say there’s only one game in town, and the payoff matrices for cooperation and defection remain static. This, of course, is not the way reality works: cooperation is sometimes mutually beneficial, other times mutually detrimental, and still others only beneficial for one of the parties involved, and all that changes the game substantially. Yes, this means we aren’t strictly dealing with cooperative dilemmas anymore, but reality is not made up of strictly cooperative dilemmas, and that matters if we’re trying to draw conclusions about reality. Adding this consideration into the models would mean that behavioral strategies are unlikely to ever cycle between “always cooperate” or “always defect” as Nowak (2012) found that they did in his models. Such strategies are too simple-minded and underspecified to be practically useful.

A second issue involves the relative costs and benefits to cooperation and defection even within the same game. Sometimes defecting may lead to great benefits for the defector; at others, defecting may only lead to small benefits. A similar situation holds for how much of a benefit cooperation will bring to one’s partner. A tit-for-tat strategy could be fooled, so to speak, by this change of rules (i.e. I could defect on you when the benefits for me are great and reestablish cooperation only when the costs to cooperation are low). As cooperation will not yield identical payoffs over time more generally, cooperation will also not yield identical payoffs between specific individuals. This would make some people more valuable to have as a cooperative partner than others and, given that cooperation takes some amount of limited time and energy, this means competition for those valuable partners. Similarly, this competition can also mean that cooperating with one person entails simultaneously defecting against another (cooperation here is zero-sum; there’s only so much to go around). Competition for these more valuable individuals can lead to all sorts of interesting outcomes: people being willing to suffer defection for the privilege of certain other associations; people actively defecting on or punishing others to prevent those others from gaining said associations; people avoiding even trying to compete for these high value players, as their odds of achieving such associations are vanishingly low. Basically, all sorts of politically-wise behaviors we see from the characters in Game in Thrones that don’t find themselves represented in these mathematical models yet.

We might also want to add a stipulations for in-game beheadings.

A final issue is that information that individuals in these games are exposed to: it’s all true information. In the non-theoretical realm, it’s not always clear as to whether someone you’ve been interacting with cooperated or defected, or the degree of effort they put into the venture even if they were on the cooperating side of the equation. If individuals in these games could reap the benefits of defecting while simultaneously convincing others that they had cooperated, that’s another game-changer. Modeling all of this is, no doubt, a lot of work, but potentially doable. It would lead to all sorts of new set of findings about which strategies worked and which one didn’t, and how, and when, and why. The larger point, however, is that the results of these mathematical models aren’t exactly findings; they’re restatements of our initial intuitions in mathematical form. Whether those intuitions are poorly developed and vastly simplified or thoroughly developed and conceptually rich is an entirely separate matter, as they’re all precisely as “real” in the theoretical domain.

References: Nowak, M. (2011). SuperCooperators: Altruism, evolution, and why we need each other to succeed. New York: Free Press

Nowak, M. (2012). Evolving cooperation Journal of Theoretical Biology, 299, 1-8 DOI: 10.1016/j.jtbi.2012.01.014

The Fight Over Mankind’s Essence

Posted on December 29, 2012 by Jesse Marczyk

All traits of biological organisms require some combination and interaction of genetic and non-genetic factors to develop. As Tooby and Cosmides put it in their primer:

Evolutionary psychology is not just another swing of the nature/nurture pendulum. A defining characteristic of the field is the explicit rejection of the usual nature/nurture dichotomies — instinct vs. reasoning, innate vs. learned, biological vs. cultural. What effect the environment will have on an organism depends critically on the details of its evolved cognitive architecture.

The details of that cognitive architecture are, to some extent, what people seem to be referring to when they use the word “innate”, and figuring out the details of that architecture is a monumental task indeed. For some reason, this task of figuring out what’s “innate” also draws some degree of what I feel is unwarranted hostility and precisely why it does is a matter of great interest. One might posit that some of this hostility is due to the term itself. “Innate” seems to be a terribly problematic term for the same two reasons that most other contentious terms are: people can’t seem to agree on a clear definition for the word or a context to apply it in, but they still use it fairly often despite that. Because of this, interpersonal communication can get rather messy, much like two teams trying to play a sport in which each is playing the game under a different set of rules; a philosophical game of Calvinball. I’m most certainly not going to be able to step into this debate and provide the definition for “innate” that all parties will come to intuitively agree upon and use consistently in the future. Instead, my goal is to review two recent papers that examined the contexts in which people’s views of innateness vary.

“Just add environment!” (Warning: chicken outcome will vary with environment)

Anyone with a passing familiarity in the debates that tend to surround evolutionary psychology will likely have noticed that most of these debates tend to revolve around issues of sex differences. Further, this pattern tends to hold whether it’s a particular study being criticized or the field more generally; research on sex differences just seems to catch a disproportionate amount of the criticism, relative to most other topics, and that criticism can often get leveled at the entire field by association (even if the research is not published in an evolutionary psychology, and even if the research is not conducted by people using an evolutionary framework). While this particular observation of mine is only an anecdote, it seems that I’m not alone in noticing it. The first of the two studies on attitudes towards innateness was conducted by Geher & Gambacorta (2010) on just this topic. They sought to determine the extent to which attitudes about sex differences might be driving opposition to evolutionary psychology and, more specifically, the degree to which those attitudes might be correlated with being an academic, being a parent, or being politically liberal.

Towards examining this issue, Geher & Gambacorta (2010) created questions aimed at assessing people attitudes in five domains: (1) human sex differences in adulthood, (2) human sex differences in childhood, (3) behavioral sex differences in chickens, (4) non-sex related human universals, and (5) behavioral differences between dogs and cats. Specifically, the authors asked about the extent to which these differences were due to nature or nurture. As mentioned in the introduction, this nature/nurture dichotomy is explicitly rejected in the conceptual foundations of evolutionary psychology and is similarly rejected by the authors as being useful. This dimension was merely used in order to capture the more common attitudes about the nature of biological and environmental causation, where the two are often seen as fighting for explanatory power in some zero-sum struggle.

Of the roughly 270 subjects who began the survey, not all of them completed every section. Nevertheless, the initial sample included 111 parents and 160 non-parents, 89 people in academic careers and 182 non-academics, and the entire sample was roughly 40 years old and mildly politically liberal, on average. The study found that political orientation was correlated with judgments of whether sex differences in humans (children and adults) were due to nature or environment, but not the other three domains (cats/dogs, chickens/hens, or human universals): specifically, those with more politically liberal leanings were also more likely to endorse environmental explanations for human sex differences. Across other domains there were some relatively small and somewhat inconsistent effects, so I wouldn’t make much of them just yet (though I will mention that women’s studies and sociology fields seemed consistently more inclined to chalk each domain – excepting the differences between cats and dogs – up to nurture, relative to other fields; I’ll also mention their sample was small). There was, however, a clear effect that was not discussed in the paper:subjects were more likely to chalk non-human animal behavior up to nature, relative to human behavior, and this effect seemed more pronounced with regards to sex differences specifically. With these findings in mind, I would echo the conclusion of the paper that there is appears to be some political, or, more specifically, moral dimension to these judgments of the relative roles of nature and nurture. As animal behavior tends to fall outside of the traditional human moral domain, chalking their behavior up to nature seemed less unpalatable for the subjects.

See? Men and women can both do the same thing on the skin of a lesser beast.

The next paper is a new release from Knobe & Samuels (2013). You might remember Knobe from his other work in asking people slightly different questions and getting vastly different responses, and it’s good to see he’s continuing on with that proud tradition. Knobe & Samuels begins by asking the reader to imagine how they’d react to the following hypothetical proposition:

Suppose that a scientist announced: ‘I have a new theory about the nature of intention. According to this theory, the only way to know whether someone intended to bring about a particular effect is to decide whether this effect truly is morally good or morally bad.’

The authors predict that most people would reject this piece of folk psychology made explicit; value judgments are supposed to be a different matter entirely from tasks like assessing intentionality or innateness, yet these judgments do not appear to be truly be independent from each other in practice. Morally negative outcomes are rated as being more intentional than morally positive ones, even if both are brought about as a byproduct of another goal. Knobe & Samuels (2013) sought to extent this line of research in the realm of attitudes about innateness.

In their first experiment, Knobe & Samuels asked subjects to consider an infant born with a rare genetic condition. This condition ensures that if a baby breastfeeds in the first two weeks of life it will either have extraordinarily good math abilities (condition one) or exceedingly poor math skills (condition two). While the parents could opt to give the infant baby formula that would ensure the baby would just turn out normal with regard to its math abilities, in all cases the parents were said to have opted to breastfeed, and the child developed accordingly. When asked about how “innate” the child’s subsequent math ability was, subjects seemed to feel that baby’s abilities were more innate (4.7 out of 7) when they were good, relative to when those abilities were poor (3.4). In both cases, the trait depended on the interaction of genes and environment and for the same reason, yet when the outcome was negative, this was seen as being less of an innate characteristic. This was followed up by a second experiment where a new group of subjects were presented with a vignette describing a fake finding about human’s genes: if people experienced decent treatment (condition one) or poor treatment (condition two) by parents at least sometimes, then a trait would reliability develop. Since most all people do experience decent or poor treatment by their parents on at least some occasions, just about everyone in the population comes to develop this trait. When asked about how innate this trait was, again, the means through which it developed mattered: traits resulting from decent treatment were rated as more innate (4.6) than traits resulting from poor treatment (2.7).

Skipping two other experiments in the paper, the final study presented these cases either individually, with each participant seeing only one vignette as before, or jointly, with some subjects seeing both versions of the questions (good/poor math abilities, decent/poor treatment) one immediately after the other, with the relevant differences highlighted. When subjects saw the conditions independently, the previous effects were pretty much replicated, if a bit weakened. However, even seeing these cases side-by-side did not completely eliminate the effect of morality on innateness judgments: when the breastfeeding resulted in worse math abilities this was still seen as being less innate (4.3) than the better math abilities (4.6) and, similarly, when poor treatment led to a trait developing it was viewed as less innate (3.8) than when it resulted from better treatment (3.9). Now these differences only reached significance because of the large sample size in the final study as they were very, very small, so I again wouldn’t make much of them, but I do still find it somewhat surprising that there were still small differences to be talked about at all.

Remember: if you’re talking small effects, you’re talking psychology.

While these papers are by no means the last word on the subject, they represent an important first step in understanding the way that scientists and laypeople alike represent claims about human nature. Extrapolating these results a bit, it would seem that strong opinions about research in evolutionary psychology are held, at least to some extent, for reasons that have little to do with the field per se. This isn’t terribly surprising, as it’s been frequently noted that many critics of evolutionary psychology have a difficult time correctly articulating the theoretical commitments of the field. Both studies do seem to suggest that moral concerns play some role in the debate, but precisely why the moral dimension seems to find itself represented in the debate over innateness is certainly an interesting matter that neither paper really gets into. My guess is that it has something to do with the perception that innate behaviors are less morally condemnable than non-innate ones (hinting at an argumentative function), but that really just pushes the question back a step without answering it. I look forward to future research on this topic – and research on explanations, more generally – to help fill in the gaps of our understanding of this rather strange phenomenon.

References: Geher, G., & Gambacorta, D. (2010). Evolution is Not Relevant to Sex Differences in Humans Because I Want it That Way! Evidence for the Politicization of Human Evolutionary Psychology EvoS: The Journal of the Evolutionary Studies Consortium , 2, 32-47

Knobe, J., & Samuels, R. (2013). Thinking like a scientist: Innateness as a case study Cognition, 126 (1), 72-86 DOI: 10.1016/j.cognition.2012.09.003

The Drifting Nose And Other “Just-So”s

Posted on December 12, 2012 by Jesse Marczyk

In my last post dealing with PZ Myer’s praise for the adaptationist paradigm, which was confusingly dressed up as criticism, PZ suggested the following hypothesis about variation in nose shape:

Most of the obvious phenotypic variation we see in people, for instance, is not a product of selection: your nose does not have the shape it does, which differs from my nose, which differs from Barack Obama’s nose, which differs from George Takei’s nose, because we independently descend from populations which had intensely differing patterns of natural and sexual selection for nose shape; no, what we’re seeing are chance variations amplified in frequency by drift in different populations.

Today’s post will be a follow-up on this point. Now, as I said before, I currently have no strong hypotheses about what past selection pressures (or lack thereof) might have been at work shaping the phenotypic variation found in noses; noses which differ in shape and size noticeably from chimps, gorillas. orangutans, and bonobos. The cross-species consideration, of course, is a separate matter from phenotypic variation within our species, but these comparisons might at least make one wonder why the human nose might look the way it does compared to other apes. If that reason(s) could be discerned, it might also tell us something about current types of variation we see in modern human populations. The reason why noses vary between species might indeed be “genetic drift” or “developmental constraint” rather than “selection”, just as the answer to within-species variation of that trait might be as well. Before simply accepting those conclusions as “obviously true” on the basis of intuition alone, though, it might do us some good to give them a deeper consideration.

“Follow your nose; it always knows! Alternatively, following evidence can work too!”

One of the concerns I raised about PZ’s hypothesis is that it does not immediately appear to make anything resembling a novel or useful prediction. This concern itself is far from new, with a similar (and more eloquently stated) point being raised by Tooby and Cosmides in 1997 [H/T to one of the commenters for providing the link]:

Modern selectionist theories are used to generate rich and specific prior predictions about new design features and mechanisms that no one would have thought to look in the absence of these theories, which is why they appeal so strongly to the empirically minded….It is exactly this issue of predictive utility, and not “dogma”, that leads adaptationists to use selectionist theories more often than they do Gould’s favorites, such as drift and historical contingency. We are embarrassed to be forced, Gould-style, to state such a palpably obvious thing, but random walks and historical contingency do not, for the most part, make tight or useful prior predictions about the unknown design features of any single species.

That’s not to say that one could not, in principle, derive a useful or novel prediction from the drift hypothesis; just that one doesn’t immediate jump out at me in this case, nor does PZ explicitly mention any specific predictions he had in mind. Without any specific predictions, PZ’s suggestion about variation in nose shape, while it may well be true to some small or large degree (given that PZ’s language chalks most to all of variation up to drift, rather than selection; it’s unclear precisely what proportion he had in mind), his claim also runs the risk of falling prey to the label of “just-so story“.

Since PZ appears to be really concerned that evolutionary psychologists do not make use of drift in their research as often as he’d like, this, it seems, would have been the perfect opportunity for him to show us how things ought to be done: he could have derived a number of useful and novel predictions from the drift hypothesis and/or shown how drift might better account for some aspects of the data in nose variation that he had in mind, relative to other current competing adaptationist theories on nose variation. I’m not even that particular about the topic of noses, really; PZ might prefer to examine a psychological phenomenon instead, as this is evolutionary psychology he’s aiming his criticisms at. This isn’t just a mindless swipe at PZ’s apparent lack of a testable hypothesis either: as long as his predictions derived from a drift hypothesis lead to some interesting research, that would be a welcome addition to any field.

Let’s move on from the prediction point towards the matter of whether selection played a role in determining current nose variation. In this area, there is another concern of mine about the drift hypothesis that reaches beyond the pragmatic one. As PZ mentions in his post, selection pressures are generally “blind” to very small fitness benefits or detriments. If your nose differs in size from mine by 1/10 of a millimeter, that probably won’t have much of an effect on eventual fitness outcomes, so that variation might stick around in the next generation relatively unperturbed. The next generation will, in turn, introduce new variation into the population due to sexual recombination and mutation. If the average difference in nose shape was 1/10 of a millimeter in the previous generation, that difference may now grow to, say, 2/10th of a millimeter. Since that difference still isn’t likely enough to make much of a difference, it sticks around into the next generation, which introduces new variation that isn’t selected for or against, and so on. These growths in average variation, while insignificant when considered in isolation, can begin to become the target of selection as they accumulate and their fitness costs and benefits begin to become non-negligible. In this hypothetical example, nose shape and size might begin to become the target of stabilizing selection where the more extreme variations are weeded out of the population, perhaps because they’re viewed as less sexually appealing or become less functional than other, less extreme variants (the factors that PZ singled out as not being important).

Ladies; start your engines…

So let’s say one was to apply an adaptationist research paradigm to nose variation and compare it to PZ’s drift hypothesis (falsely assuming, for the moment, that an adaptationist research paradigm is in some way supposed to be opposed to a drift one). A researcher might begin by wondering what functions nose shape could have. Note that these need not be definitive conclusions; merely plausible alternatives. Once our researcher has generated some possible functions, they would begin to figure out ways of testing these candidate alternatives. Noback et al (2011), for instance, postulated that the nasal cavity might function, in part, to warm and humidify incoming air before it reaches the lungs and, accordingly, predicted that nasal cavities ought to be expected to vary contingent on the requirements of warming and humidifying across varying climates.

This adaptationist research paradigm generated six novel predictions, which is a good start compared to PZ’s zero. Noback et al (2011) then tested test predictions against 100 skulls from 10 different populations spanning 5 different climates. The resulted indicated significant correlations between nearly every climate factor (temperature and humidity) and nasal cavity shape. Further, the authors managed to disconfirm more than one of their initial hypotheses, and were also able to suggest that these variations in nasal cavity shape were not due solely to allometric effects. They also mention plenty of variation is left unexplained, and some variation in nasal cavity variation might also be due to tradeoffs between warming and humidifying incoming air and functions of the nose (such as olfaction).

So, just to recap, this adaptationist research concerning nose variation yielded a number of testable predictions (it’s useful), found evidence consistent with them in some cases but not others (it’s falsifiable), tested alternative explanations (variation not solely due to allometry), mentioned tradeoffs between functions, and left plenty of variation unexplained (did not assume every feature was an adaptation). This is compared to PZ’s drift hypothesis, which made no explicit predictions, cited no data, made no mention of function (presumably because it would postulate there isn’t one), and would seem to not be able to account well for this pattern of results. Perhaps PZ might note this research deals primarily with internal features of the nose; not external ones, and the external features are what he had mind when he proposed the drift hypothesis. As he’s not explicit about which parts of nose shape were supposed to be under discussion, it’s hard to say whether he feels results like these would pose any problems for his drift hypothesis.

Moving targets can be notoriously difficult to hit

While I still remain agnostic about the precise degree to which variation in nose shape has been the target of selection, as I’m by no means an expert on the subject, the larger point here is how useful adaptationist research can be. It’s not enough to just declare that variation in a trait is obviously the product of drift and not selection and leave it at that in much the same way that one can’t just assume a trait in an adaptation. As far as I see it, neither drift nor adaptation ought to be the null hypothesis in this case. Predictions need to be developed and tested against the available data, and the adaptationist paradigm is very helpful in generating those predictions and figuring out what data might be worth testing. That’s most certainly not to say those predictions will always be right, or that the research flowing from someone using that framework will always be good. The point is just that adaptationism itself is not the problem PZ seems to think it is.

References: Noback, M., Harvati, K., & Spoor, F. (2011). Climate-related variation of the human nasal cavity American Journal of Physical Anthropology, 145 (4), 599-614 DOI: 10.1002/ajpa.21523

PZ Myers: Missing The Mark

Posted on December 11, 2012 by Jesse Marczyk

As this year winds itself to a close, I’ve decided to treat myself to writing another post that allows me to engage more fully in my debating habit. The last post I did along these lines dealt with the apparent moral objection some people have for taking money from the wrong person, and why I felt they were misplaced. Today, the subject will be PZ Myers, who, as normally seems to be the case, appears to still have a dim view of evolutionary psychology. In this recent post, PZ suggested that evolutionary psychology is rotten right down to its theoretical core because of an apparently fatal misconception: adaptationism. Confusingly, PZ begins his attack on this fatal misconception by affirming that selection is an important mechanism, essential for fully understanding evolution, and ought not be ignored by researchers. In essence, PZ’s starting point is that the fatal flaw of evolutionary psychology is, in addition to not being a flaw, a vital conceptual paradigm.

Take that?

If you’re looking for anything in this post about why adaptationism per se is problematic, or a comparison demonstrating that research in psychology that makes use of adaptationism is generally inferior to research conducted without that paradigm, you’re liable to disappointed by PZ’s latest offering. This is probably because very little of his post actually discusses adaptationism besides his praise of it; you know, that thing that’s supposed to be a seriously flawed foundation. So given that PZ doesn’t appear to actually be talking about adaptationism itself being a problem, what is he talking about? His main concern would seem to be that he feels that other evolutionary mechanisms – specifically, genetic drift and chance – are not as appreciated as explanatory factors as he would prefer. He’s more than welcome to his perception of whether or not some factors are under-appreciated. In fact, he’s even willing to share an example:

Most of the obvious phenotypic variation we see in people, for instance, is not a product of selection: your nose does not have the shape it does, which differs from my nose, which differs from Barack Obama’s nose, which differs from George Takei’s nose, because we independently descend from populations which had intensely differing patterns of natural and sexual selection for nose shape; no, what we’re seeing are chance variations amplified in frequency by drift in different populations.

While I currently have no strong hypotheses one way or another about past selections on nose shape, PZ certainly seems to: he feels that current variation in nose shape is obviously due to genetic drift. Now I know it might seem like PZ is advancing a claim about past selections pressures with absolutely no evidence; it also might seem like his claim makes no readily apparent testable predictions, making it more of a just-so story; it might even seem that these sort of claims are the kind that are relatively less likely to ever see publication for the former two reasons. In all fairness, though, all of that only seems that way because all those things happen to also be true.

Moving onto his next point, PZ notes that chance factors are very important in determining the direction evolution will take when selection coefficients are small and the alleles in question aren’t well-represented in the gene pool. In other words, there will be some deleterious mutations that happen to linger around in populations because they aren’t bad enough to be weeded out by selection, and some variations that would be advantageous but never end up being selected. This is a fine point, really; it just has very little to do with adaptationism. It has even less to do with his next point, which involves whether color preference has any functional design. Apparently, as an evolutionary psychologist, I’m supposed to have some kind of feelings about the matter of color preference by association, and these feelings are supposed to be obviously wrong. (If I’m interpreting PZ properly, that is. Of course, if I’m not supposed to share some opinion about color preference, it would be strange indeed for him to bring that example up…)

“Well, I guess I can’t argue with that logic…”

Unfortunately, PZ doesn’t get his fill of the Pop Anti-Evolutionary Psychology Game in this first go of collective guilt by association, so he takes another pass at it by asserting that evolutionary psychologists begin doing research by assuming that what they’re studying is a functional adaptation. For those unwilling to click through the link:

…[T]he “Pop Anti-Evolutionary Psychology Game.” Anyone can play…First, assert something that evolutionary psychologists think. These assertions can come in any of a number of flavors, the only requirement being that it has to be something that is obviously false, obviously stupid, or both…hyper-adaptationism is always a good option, that evolutionary psychologists assume that all traits are adaptations…The second part of the game should be obvious. Once you’ve baldly asserted what evolutionary psychologists believe…point out the blindingly obvious opposite of the view you’ve hung on evolutionary psychology.

This is, I think, supposed to be the problem that PZ was implying he had with evolutionary psychology more generally and adaptationism specifically. If this was supposed to be his point all along, he really should have put it at the beginning. In fact, had he simply written “not all traits and variations of those traits are adaptations” he could have saved a lot of time and been met with agreement from, of all people, evolutionary psychologists.

Breaking with tradition, PZ does mention that there have been some evolutionary psychology papers that he likes. I can only suppose their foundational concept was somehow different from the ones he doesn’t like. Confusingly, however, PZ also goes on to say that he tends to like evolutionary psychology papers more as they gets away from the “psychology” part of things (the quotes are his and I have no idea what they are supposed to mean), and focus more on genetics, which makes me wonder about whether he’s actually reading papers in the field he thinks he is…

“No; I’m not lost, and no, I won’t stop and ask for directions”

Finally, PZ ends his rather strange post by asserting that we can’t learn anything of importance evolutionarily from studying undergraduates (which isn’t a novel claim for him). I’m most certainly in favor of research with more diverse cross-cultural samples, and moving beyond the subject pool is a good thing for all researchers in psychology to do. The assertion that we can’t learn anything of value from this sample of people strikes me as rather strange, though. It would be nice, I suppose, if PZ could helpfully inform us as to which types of people we could potentially learn important psychological things from, what kind of important things those might be, and why those things are specific to those samples, but I suspect he’s saving that wisdom up for another day.

Are Associations Attitudes?

Posted on December 4, 2012 by Jesse Marczyk

If there’s one phrase that people discussing the results of experiments have heard more than any other, a good candidate might be “correlation does not equal causation”. Correlations can often get mistaken for (at least implying) causation, especially if the results are congenial to a preferred conclusion or interpretation. This is a relatively uncontroversial matter which has been discussed to death, so there’s little need to continue on with it. There is, however, a related reasoning error people also tend to make with regard to correlation; one that is less discussed than the former. This mistake is to assume that a lack of correlation (or a very low one) means no causation. Here are two reasons one might find no correlation, despite underlying relationships: in the first case, no correlation could result from something as simple as there being no linear relationship between two variables. As correlations only measure linear relationships, distributions that resemble bell curves would tend to yield correlations equal to zero.

For the second case, consider the following example: event A causes event B, but only in the absence of variable C. If variable C randomly varies (it’s present half the time and absent the other half), [EDIT: H/T Jeff Goldberg] you might end up with no correlation, or at least a very reduced one, despite direct causation. This example becomes immediately more understandable if you relabel “A” as heterosexual intercourse, “B” as pregnancy, and “C” as contraceptives (ovulation works too, provided you also replace “absence” with presence). That said, even if contraceptives aren’t in the picture, the correlation between sexual intercourse and pregnancy is still pretty low.

And just in case you find that correlation reaching significance, there’s always this.

So why all this talk about correlation and causation? Two reasons: first, this is my website and I find the matter pretty neat. More importantly, though, I’d like to discuss the IAT (implicit association test) today; specifically, I’d like to address the matter of how well the racial IAT correlates (or rather, fails to correlate) with other measures of racial prejudice, and how we ought to interpret that result. While I have touched on this test very briefly before, it was in the context of discussing modularity; not dissecting the test itself. Since the IAT has recently crossed my academic path again on more than one occasion, I feel it’s time for a more complete engagement with it. I’ll start by discussing what the IAT is, what many people seem to think it measures, and finally what I feel it actually assesses.

The IAT was introduced by Greenwald et al in 1998. As per its namesake, the test was ostensibly designed to do something it would appear to do fairly well: measure the relative strengths of initial, automatic cognitive associations between two concepts. If you’d like to see how this test works firsthand, feel free to follow the link above, but, just in case you don’t feel like going through the hassle, here’s the basic design (using the race-version of the test): subjects are asked to respond as quickly as possible to a number of stimuli. In the first phase, subjects will view pictures of black and white faces flashed on the screen and asked to press one key if the face is black and another if it’s white. In the second phase, subjects will do the same task, but this time they’ll press one key if the word that flashes on the screen is positive and another if it’s negative. Finally, these two tasks are combined, with subjects asked to press one key if the face is white or the word is positive, and another key if the face is black or the word is negative (these conditions then flip). Different reaction times in this test are taken to be measures of implicit cognitive associations. So, if you’re faster to categorize black faces with positive words, you’re said to have a more positive association towards black people.

Having demonstrated that many people seem to show a stronger association between white faces and positive concepts, the natural question arises about how to interpret these results. Unfortunately, many psychological researchers and laypeople alike have taken a unwarranted conceptual leap: they assume that these differential association strengths imply implicit racist attitudes. This assumption happens to meet with an unfortunate snag, however, which is that these implicit associations tend to have very weak to no correlations with explicit measures of racial prejudice (even if the measures themselves, like the Modern Racism Scale, are of questionable validity to begin with). Indeed, as reviewed by Arkes & Tetlock (2004), whereas the vast majority of undergraduates tested manifest exceedingly low levels of “modern racism”, almost all of them display a stronger association between white faces and positivity. Faced with this lack of correlation, many people have gone on to make a second assumption to account for this lack, that assumption being that the implicit measure is able to tap some “truer” prejudiced attitude that the explicit measures are not as able to tease out. I can’t help but wonder, though, what those same people would have had to say if positive correlations had turned up…

“Correlations or no, there’s literally no data that could possibly prove us wrong”

Arkes & Tetlock (2004) put forth three convincing reasons to not make that conceptual jump from implicit associations to implicit attitudes. Since I don’t have the space to cover all their objections, I’ll focus on the key points of them. The first is one that I feel ought to be fairly obvious: quicker associations between whites and positive concepts are capable of being generated by merely being aware of racial stereotypes, irrespective of whether one endorses them on any level, conscious or not. Indeed, even African American subjects were found to manifest pro-white biases in these tests. One could take those results as indicative of black subjects being implicit racist against their own ethnic group, though it would seem to make more sense to interpret those results in terms of the black subjects being aware of the stereotypes they did not endorse. The latter interpretation also goes a long way towards understanding the small and inconsistent correlations between the explicit and implicit measures; the IAT is measuring a different concept (knowledge of stereotypes) than the explicit measures (endorsement of stereotypes).

In order to appreciate the next criticism of this conceptual leap, there’s an important point worth bearing in mind concerning this IAT: the test doesn’t measure where two concepts are associated in any sense whatsoever; it merely measures relative strengths of these associations (for example, “bread” might be more strongly associated with “butter” than it is with “banana”, though it might be more associated with both than with “wall”). This importance of this point is that the results of the IAT do not test whether there is a negative association towards any one group; just whether one group is rated more positively than another. While whites might have a stronger association with positive concepts than blacks, it does not follow that blacks have a negative association overall, nor that whites have a particularly positive one either. Both groups could be held in high or low regard overall, with one being slightly favored. In much the same way, I might enjoy eating both pizza and turkey sandwiches, but I would tend to enjoy eating pizza more. Since the IAT does not track whether these response time differentials are due to hostility, these results do not automatically seem to apply well to most definitions of prejudice.

Finally, the authors make the (perhaps politically incorrect) point that noticing behavioral differences between groups – racial or otherwise – and altering behavior accordingly is not, de facto, evidence of an irrational racial biases; it could well represent the proper use of Bayesian inference, passing correspondence benchmarks for rational behavior. If one group, A, happens to perform behavior X more than group B, it would be peculiar to ignore this information if you’re trying to predict the behavior of an individual from one of those groups. In fact, when people fail to do as much in other situations, people tend to call that failure a bias or an error. However, given that race is touchy political subject, people tend to condemn others for using what Arkes & Tetlock (2004) call “forbidden base rates”. Indeed, the authors report that previous research found subjects were willing to condemn an insurance company for using base rate data for the likelihood of property damage in certain neighborhoods when that base rate also happened to correlate with the racial makeup of that neighborhood (but not when those racial correlates were absent).

A result which fits nicely with other theory I’ve written about, so subscribe now and don’t miss any more exciting updates!

To end this on a lighter, (possibly) less politically charged note, a final point worth considering is that this test measures the automaticity of activation; not necessarily the pattern of activation which will eventually obtain. While my immediate reaction towards a brownie within the first 200 milliseconds might be “eat that”, that doesn’t mean that I will eventually end up eating said brownie, nor would it make me implicitly opposed toward the idea of dieting. It would seem that, in spite of these implicit associations, society as a whole has been getting less overtly racist. The need for researchers to dig this deep to try and study racism could be taken as heartening, given that we, “now attempt to gauge prejudice not by what people do, or by what people say, but rather by millisecs of response facilitation of inhibition in implicit association paradigms” (p.275). While I’m sure there are still many people who will make a lot about these reaction time differentials for reasons that aren’t entirely free from their personal politics, it’s nice to know just how much successful progress our culture seems to have made towards eliminating racism.

References: Arkes, H.R., & Tetlock, P.E. (2004). Attributions of implicit prejudice, or “Would Jesse Jackson ‘fail’ the implicit association test?” Psychological Inquiry , 15, 257-278

Greenwald, A.G., McGhee, D.E., & Schwartz, J.L.K. (1998). Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74, 1464-1480

Does “Statistical Significance” Imply “Actually Significant”?

Posted on November 20, 2012 by Jesse Marczyk

P-values below 0.05; the finding and reporting of these values might be considered the backbone of most psychological research. Conceptually, these values are supposed to represent the notion that, if the null hypothesis is true, the odds of observing some set of results are under 5%. As such, if one observes a result unlikely to be obtained by chance, this would seem to carry the implication that the null hypothesis is unlikely to be true and there are likely real differences between the group means under examination. Despite null hypothesis significance testing becoming the standard means of statistical testing in psychology, the method is not without its flaws, both on the conceptual and practical levels. According to a paper by Simmons et al (2011), on the practical end of things, some of the ways in which researchers are able to selectively collect and analyze data can dramatically inflate the odds of obtaining a statistically significant result.

Don’t worry though; it probably won’t blow up in your face until much later in your career.

Before getting to their paper, it’s worth covering some of the conceptual issues inherent with null hypothesis significance testing, as the practical issues can be said to apply just as well to other kinds of statistical testing. Brandstaetter (1999) raises two large concerns about null hypothesis significance testing, though really they’re more like two parts of the same concern, and, ironically enough, almost sound as if they’re opposing points. The first part of this concern is that classic significance testing does not tell us whether the results we observed came from a sample with a mean that was actually different from the null hypothesis. In other words, a statistically significant result does not tell us that the null hypothesis is false; in fact, it doesn’t even tell us the null hypothesis is unlikely. According to Brandstaetter (1999), this is due to the logic underlying significance testing being invalid. The specific example that Brandstaetter uses references the rolling of dice: if you roll a twenty-sided die, it’s unlikely (5%) that you will observe a 1; however, if you observe a 1, it doesn’t follow that it’s unlikely you rolled the die.

While that example addresses null hypothesis testing at a strictly logical level, this objection can be dealt with fairly easily, I feel: in Brandstaetter’s example, the hypothesis that one would be testing is not “the die was rolled”, so that specific example seems a bit strange. If you were comparing the heights of two different groups (say, men and women), and you found one group was, in your sample, an average of six inches, it might be reasonable to conclude that it’s unlikely that the population means that the two samples come from are the same. This is where the second part of the criticism comes into play: in reality, the means of different groups are almost guaranteed to be different in some way, no matter how small or large that difference is. This means that, strictly speaking, the null hypothesis (there is no mean difference) is pretty much always false; the matter then becomes whether your test has enough power to reach statistical significance, and increasing your sample size can generally do the trick in that regard. So, in addition to not telling us about whether the null hypothesis is true or false, the best that this kind of significance testing can do is tell us a specific value that a population mean is not. However, since there are an infinite number of possible values that a population mean could hypothetically take, the value of this information may be minimal.

Even in the best of times, then, significance testing has some rather definite conceptual concerns. These two conceptual issues, however, seem to be overshadowed in importance by that practical issues that arise during the conducting of research; what Simmons et al (2011) call “researcher degrees of freedom”. This term is designed to capture some of the various decisions that researchers might make over the course of collecting and analyzing data while hunting for statistically significant results capable of being published. As publications are important for any researcher’s career, and statistically significant results are the kind that are most likely to be published (or so I’ve been told), this combination of pressures can lead to researchers making choices – albeit not typically malicious ones – that increase their chances of finding such results.

“There’s a significant p-value in this mountain of data somewhere, I tell you!”

Simmons et al (2011) began by generating random samples all pulled from a normal distribution across 15,000 independent simulations. Since they were testing for how often statistically significant effects were found, if they were using classic significance testing, that rate should not tend to exceed 5%. When there were two dependent measures capable of being analyzed (in their example, these were willingness to pay and liking), the ability to analyze these two measures separately or in combination nearly doubled the chances of finding a statistically significant “effect” at the 0.05 level. That is to say, the odds of finding an effect by chance were no longer 5%, but closer to 10%. A similar effect was found given the researchers controlled for gender. This makes intuitive sense, as it’s basically the same manipulation as the former two-measure case, just with a different label.

There’s similar bad news for the peak-and-test method that some researchers make use of with their data. In these cases, a researcher will collect some number of subjects for each condition – say 20 – and conduct a test to see if they found an effect. If an effect is found, the researcher will stop collecting data; if the effect isn’t found, the researcher will then collect another number of observations per condition – say another 10 – and then retest for significance. A researcher’s ability to peak at their data increased the odds of finding an effect by chance up to about 8%. Finally, if the researcher decides to run multiple levels of a condition (Simmons et al’s example concerned splitting the sample into low, medium, and high conditions), the ability to selectively compare these conditions to each other brought the false positive rate up to 12.6%. Worrying, if these four degrees of researcher freedom were combined, the odds of finding a false positive were as high as 60%; that is, the odds are better that you would find some effect strictly by chance than you wouldn’t. While these results might have been statistically significant, they are not actually significant. This is a fine example of Brandstaetter’s (1999) initial point: significance testing does not tell us that the null hypothesis is true or likely, as it should have been in all these cases.

As Simmons et al (2011) also note, this rate of false positives might even be conservative, given that there are other, unconsidered liberties that researchers can take. Making matters even worse, there’s the aforementioned publication bias, in that, at least as far as I’ve been led to believe, journals tend to favor publications that (a) find statistically significant results and (b) are novel in their design (i.e. journals tend to not publish replications). This means that when false positives are found, they’re both more likely to make their way into journals and less likely to subsequently be corrected. In turn, those false positives could lead to poor research outcomes, such as researchers wasting time and money chasing effects that are unlikely to be found again, or actually reinforcing the initial false-positive in the event they go chasing after it, it actually is found by chance, and subsequently published again.

“With such a solid foundation, it’s difficult to see how this could have happened”

Simmons et al (2011) do put forth some suggestions as to how these problems could begin to be remedied. While I think their suggestions are all, in the abstract, good ideas, they would likely also generate a good deal more paperwork for researchers to deal with, and I don’t know a researcher alive who craves more paperwork. While there might be some tradeoff, in this case, between some amount of paperwork and eventual research quality, there is one point that Simmons et al (2011) do not discuss when it comes to remedying this issue, and that’s the matter I have been writing about for some time: the inclusion of theory in research. In my experience, a typical paper in psychology will give one of two explicit reasons for its being conducted: (1) an effect was found previously, so the researchers are looking to either find it again (or not find it), or (2) the authors have a hunch they will find an effect. Without an real theoretical framework surrounding these research projects, there is little need to make sense of or actually explain a finding; one can simply say they discovered a “bias” or a “cognitive blindness” and leave it at that. While I can’t say how much of the false-positive problem could be dealt with by requiring the inclusion of some theoretical framework for understanding one’s results when submitting a manuscript, if any, I feel some theory requirement would still go a long way towards improving the quality of research that ends up getting published. It would encourage researchers to think more deeply about why they’re doing what they’re doing, as well as help readers to understand (and critique) the results they end up seeing. While dealing with false positives should certainly be a concern, merely cutting down on their appearance is not be enough to help research quality in psychology progress appreciably.

References: Brandstaetter (1999). Confidence intervals as an alternative to significance testing. Methods of Psychological Researcher Online, 4.

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

A Frequentist And A Bayesian Walk Into Infinity…

Posted on November 10, 2012 by Jesse Marczyk

I’m going to preface this post by stating that statistics is not my primary area of expertise. Admittedly, this might not be the best way of generating interest, but non-expertise hasn’t seem to have stopped many a teacher or writer, so I’m hoping it won’t be too much of a problem here. This non-expertise, however, has apparently also not stopped me from stumbling upon an interesting question concerning Bayesian statistics. Whether this conceptual problem I’ve been mulling over would actually prove to be a problem in real-world data collection is another matter entirely. Then again, there doesn’t appear to be a required link between academia and reality, so I won’t worry too much about that while I indulge in the pleasure of a little bit of philosophical play time.

The link between academia and reality is about as strong as the link between my degree and a good job.

So first, let’s run through a quick problem using Bayesian statistics. This is the classic example that I was introduced to the idea by: say that you’re doctor trying to treat an infection that has broken out among a specific population of people. You happened to know that 5% of the people in this population are actually infected and you’re trying to figure out who those people are so you can at least quarantine them. Luckily for you, you happen to have a device that can test for the presence of this infection. If you use this device to test an individual who actually has the disease, it will come back positive 95% of the time; if the individual does not have the disease, it will come back positive 5% of the time. Given that an individual has tested positive for the disease, what is the probability that they actually have it? The answer, unintuitive to most, is 50%.

Though the odds of someone testing positive if they have the disease are high (95%), very few people actually have the disease (5%). So 5% of the 95% of the people who don’t have an infection will test positive and 95% of the of 5% of people who do have an infection also will. In case that example ran by too quickly, here’s another brief video example using hipsters drinking beer over treating infection. This method of statistical testing would seem to have some distinct benefits: for example, it will tell you the probability of your hypothesis, given your data, rather than the probability of your data, given your hypothesis (which, I’m told, is what most people actually want to be calculating). That said, I see two (possibly major) conceptual issue with this type of statistical analysis. If anyone more versed in these matters feels they have good answers to them, I’d be happy to hear it in the comments section.

The first issue was raised by Gelman (2008), who was discussing the usefulness of our prior knowledge. In the above examples, we know some information ahead of time (the prevalence of an infection or hipsters); in real life, we frequently don’t know this information; in fact, it’s often what we’re trying to estimate when we’re doing our hypothesis tests. This puts us in something of a bind when it comes to using Bayes’ formula. Lacking objective knowledge, one could use what are called subjective priors, which represent your own set of preexisting beliefs about how likely certain hypotheses are. Of course, subjective priors have two issues: first, they’re unlikely to be shared uniformly between people, and if your subjective beliefs are not my subjective beliefs, we’ll end up coming to two different conclusions given the same set of data. It’s also probably worth mentioned that subjective beliefs do not, to the best of my knowledge, actually effect the goings-on in the world: that I believe it’s highly probable it won’t rain tomorrow doesn’t matter; it either will or I won’t, and no amount of belief will change that. The second issue concerns the point of the hypothesis test; if you already have a strong prior belief about the truth of a hypothesis, for whatever reason you do, that would seem to suggest there’s little need for you to actually collect any new data.

On the plus side, doing research just got way easier!

One could attempt to get around this problem by using a subjective, but uninformative prior; that is, distribute your belief uniformly over your set of possible outcomes, or to enter into your data analysis with no preconceptions about how it’ll turn out. This might seem like a good solution to the problem, but it would also seem to make your priors all but useless. If you’re multiplying by the same constant, you can just drop it from your analysis. So it would seem in both cases, priors don’t do you a lot of good: they’re either strong, in which case you don’t need to collect more data, or uninformative, in which case they’re pointless to include in the analysis. Now perhaps there are good arguments to be made for subjective priors, but that’s not the primary point I hoped to address; my main criticism involves what’s known as the gambler’s fallacy.

This logical fallacy can be demonstrated with the following example: say you’re flipping a fair coin; given that this coin has come up heads 10 times in a row, how likely will the probability of a tails outcome be on the next flip? The answer, of course, is 50%, as a fair coin is one that is unbiased with respect to which outcome will obtain when you flip it; the probability of a heads outcome using this coin is always as likely as a tails outcome. However, someone making the gambler’s fallacy will suggest that the coin is more likely to come up tails, as all the heads outcomes makes the tails outcome feel more likely; as if a tails outcome is “due” to come up. This is incorrect, as each flip of this coin is independent of the other flips, so knowing what the previous outcomes of this coin have been tell you nothing about what the future outcomes of the coin will be, or, as others have put it, the coin has no memory. As I see it, Bayesian analysis could lead one to engaging in this fallacy (or, more precisely, something like the reverse gambler’s fallacy).

Here’s the example I’ve been thinking about: consider that you have a fair coin and an infinite stretch of time over which you’ll be flipping it. Long strings of heads or tails outcomes (say 10,000 in a row, or even 1,000,000 and beyond in a row) are certainly improbable, but given an infinite amount of time, they become an inevitability outcomes that will obtain eventually. Now, if you’re a good Bayesian, you’ll update your posterior beliefs following each outcome. In essence, after a coin comes up heads, you’ll be more likely to think that it will come up heads on the subsequent flip; since heads have been coming up, more heads are due to come up. Essentially, you’ll be suggesting that these independent events are not actually independent of each other, at least with respect to your posterior beliefs. Given these long strings of heads and tails which will inevitably crop up, over time you will go from believing the coin is fair, to believing that it is nearly completely biased towards both heads and tails and back again.

Though your beliefs about the world can never have enough pairs of flip-flips…

It seems to me, then, that you want some statistical test that will, to some extent, try and take into account data that you did not obtain, but might have if you want to more accurately estimate the parameter (in this case, the fairness of the coin: what might have happened if I flipped the coin another X number of times). This is, generally speaking, anathema to Bayesian statistics as I understand it, who only concern themselves with the data that was collected. Of course, that does raise the question of how one can accurately predict what data they might have obtained, but did not, for which I don’t have a good answer. There’s also the matter of precisely how large of a problem this hypothetical example poses for Bayesian statistics when you’re not dealing with an infinite number of random observations; in the real world, this conceptual problem might not be much of one as these events are highly improbable, so it’s rare that anyone will actually end up making this kind of mistake. That said, it is generally a good thing to be as conceptually aware of possible problems as we can be if we want any hope of fixing them.

References: Gelman, A. (2008). Objections to Bayesian statistics Bayesian Analysis, 3, 445-450 DOI: 10.1214/08-BA318

Differentiating Between Effects And Functions

Posted on October 27, 2012 by Jesse Marczyk

A few days ago, I had the misfortune of forgetting my iPod when I got to the gym. As it turns out, I hadn’t actually forgotten it; it had merely fallen out of my bag in the car and I hadn’t noticed, but the point is that I didn’t have it on me. Without the music that normally accompanies my workout I found the experience to be far less enjoyable than it normally is; I would even go so far as to say that it was more difficult to lift what I normally do without much problem. When I mentioned the incident to a friend of mine she expressed surprise that I actually managed to stick around to finish my workout without it; in fact, on the rare occasions I end up arriving at the gym without any source of music, I typically don’t end up even working out at all, demonstrating the point nicely.

“If you didn’t want that bar to be crushing your windpipe, you probably shouldn’t have forgotten your headphones…”

In my experience, listening to music most certainly has the effect of allowing me to enjoy my workout more and push myself harder. The question remains, however, as to whether such effects are part of the function of music; that is to ask do we have some cognitive adaptation(s) designed to generate that outcome from certain given inputs? On a somewhat related note, I recently got around to reading George C Williams book, Adaptation and Natural Selection (1966). While I had already been familiar with most of what he talked about, it never hurts to actually go back and read the classics. In the book, Williams makes a lot of the above distinction between effects and functions throughout the book; what we might also label as byproducts and adaptations respectively. A simple example would demonstrate the point: while a pile of dung might serve as a valuable resource for certain species of insects, the animals which produce such dung are not doing so because it benefits the insects; the effect in this case (benefiting insects) is not the function of the behavior (excreting wastes).

This is an important theoretical point; one which Williams repeatedly brings to bear against the group selection arguments that people were putting forth at the time he was writing. Just because populations of organisms tend to have relatively stable population sizes – largely by virtue or available resources and predation – that effect does not imply there is a functional group-size-regulation adaptation activity generating that outcome. While effects might be suggestive of functions, or at least preliminary requirements for demonstrating function, they are not alone sufficient evidence for them. Adapted functionality itself is often a difficult thing to demonstrate conclusively, which is why Williams offered his now famous quote about adaptation being an onerous concept.

This finally brings us to a recent paper by Dunbar et al (2012) in which the authors find an effect of performing music on pain tolerance; specifically, it’s the performance of music per se, not the act of passively listening to it, that results in an increased pain tolerance. While it’s certainly a neat effect, effects are a dime a dozen; the question of relevance would seem to be whether this effect bears on a possible function for music. While Dunbar et al (2012) seem to think it does, or at least that it might, I find myself disagreeing with that suggestion rather strongly; what they found strikes me more as an effect without any major theoretical implications.

If that criticism stings too much, might I recommend some vigorous singing?

First, a quick overview of the paper: subjects were tested twice for their pain tolerance (as measured by the time people could stand the application of increasing pressure or holding cold objects), both before and after a situation in which they either performed music (singing, drumming, dancing, or practicing) or listened to it (varying the tempo of the music). In most cases it was the active performance of music which led to a subsequent increase in pain tolerance, rather than listening. The exception to that set of findings was that the groups that were simply practicing in a band setting did not show this increase, a finding which Dunbar et al (2012) suggest has to do with the vigor, likely the physical kind, with which the musicians were engaged in their task, not the performance of music per se.

Admittedly, that last point is rather strange from the point of view of trying to build a functional account for music. If it’s the physical activity that causes an increase in pain tolerance, that would not make the performance of music special with respect to any other kind of physical activity. In other words, one might be able to make a functional account for pain sensitivity, but it would be orthogonal to music. For example, in their discussion, the authors also note that laughter can also lead to an increase in pain tolerance as well. So really there isn’t much in this study that can speak to a function of music specifically. Taking this point further, Dunbar et al (2012) also fail to provide a good theoretical account as to how one goes from an increased pain tolerance following music production to increases in reproductive success. From my point of view, I’m still unclear as to why they bothered to examine the link between music production and pain the first place (or, for that matter, why they included dancing, since while dancing can accompany music, it is not itself a form of music, just like my exercise can accompany music, but it not music-related itself).

Dunbar et al (2012) also mention in passing at the end of their paper that music might provide some help to the ability to entrain synchronized behavior, which in turn could lead to increases in group cooperation which, presumably, they feel would be a good thing, adaptively speaking, for the individuals involved in said group. Why this is in the paper is also a bit confusing to me, since it appears to have nothing to do with anything they were talking about or researching up to that point. While it would appear to be, at least on the face of it, a possible theoretical account for a function of music (or at least a more plausible one than their non-existent reason for examining pain tolerance) nothing in the paper seems to directly or indirectly speak to it.

And believe you me, I know a thing or two about not being spoken to…

While this paper serves as an excellent example of some of the difficulties in going from effect to function, another point worth bearing in mind is how little gets added to this account by sketching out the underlying physical substrates through which this effect is generated. Large sections of the Dunbar et al paper is dedicated to these physiological outlines of the effect without many apparent payoff. Don’t get me wrong: I’m not saying that exploring the physiological pathways through which adaptations act is a useless endeavor, it’s just that such sketches do not add anything to an account that’s already deficient in the first place. They’re the icing on top of the cake; not it’s substance. Physiological accounts, while they can be neat if they’re your thing, are not sufficient for demonstrating functionality for exactly the same reasons that effects aren’t; all physiological accounts are, essentially, simply detailed accounts of effects, and byproducts and adaptations alike both have effects.

While this review of the paper itself might have been cursory, there are some valuable lessons to learn from it: (1) always try and start your research with some clearly stated theoretical basis, (2) finding effects does not mean you’ve found a function, (3) sketching effects in greater detail at a physiological level does not always help for developing a functional account, and (4) try and make sure the research you’re doing maps onto your theoretical basis, as tacking on an unrelated functional account at the end of your paper is not good policy; that account should come first, not as an afterthought.

References: Dunbar RI, Kaskatis K, Macdonald I, & Barra V (2012). Performance of music elevates pain threshold and positive affect: Implications for the evolutionary function of music. Evolutionary psychology : an international journal of evolutionary approaches to psychology and behavior, 10 (4), 688-702 PMID: 23089077

Williams, G.C. (1966). Adaptation and natural selection: A critique of some current evolutionary thought. Princeton University Press: NJ

No, Really; Domain General Mechanisms Don’t Work (Either)

Posted on October 17, 2012 by Jesse Marczyk

Let’s entertain a hypothetical situation in which your life path had led you down the road to becoming a plumber. Being a plumber, your livelihood depends on both knowing how to fix certain plumbing-related problems and having the right tools for getting the job done: these tools would include a plunger, a snake, and a pair of clothes you don’t mind not wearing again. Now let’s contrast being a plumber with being an electrician. Being an electrician also involves specific knowledge and the right tools, but those sets do not overlap well with those of the plumber (I think, anyway; I don’t know too much about either profession, but you get the idea). A plumber that shows up for their job with a soldering iron and wire-strippers is going to be seriously disadvantaged at getting that job done, just as a plunger and a snake are going to be relatively ineffective at helping you wire up the circuits in a house. The same can be said for your knowledge bases as well: knowing how to fix a clogged drain will not tell you much about how to wire a circuit, and vice versa.

Given that these two jobs make very different demands, it would be surprising indeed to find a set of tools and knowledge that worked equally well for both. If you wanted to branch out from being a plumber to also being an electrician, you would subsequently need new additional tools and training.

And/Or a very forgiving homeowner’s insurance policy…

Of course, there is not always, or even often, a 1-to-1 relationship between the intended function of a tool and the applications towards which it can be put. For example, if your job involves driving in a screw and you happen to not have a screwdriver handy, you could improvise and use, say, a knife’s blade to turn the screw as well. That a knife can be used in such a fashion, however, does not mean it would be preferable to do away with screwdrivers altogether and just carry knives instead. As anyone who has ever attempted such a stunt before can attest to, this is because knives often do not make doing the job very quick or easy; they’re generally inefficient in achieving that goal, given their design features, relative to a more functionally-specific tool. While a knife might work well as a cutting tool and less well as screwdriver, it would function even worse still if used as a hammer. What we see here is that as tools become more efficient at one type of task, they often become less efficient at others to the extent that those tasks do no overlap in terms of their demands. This is why it’s basically impossible to design a tool that simply “does useful things”; the request is massively underspecified, and the demands of one task do not often highly correlate to the demands of another. You first need narrow the request by defining what those useful things are you’re trying to do, and then figure out ways of effectively achieving your more specific goals.

It should have been apparent well before this point that my interest is not in jobs and tools per se, but rather in how these examples can be used to understand the functional design of the mind. I previously touched briefly on why it would be a mistake to assume that domain-general mechanisms would lead to plasticity in behavior. Today I hope to expand on that point and explain why we should not expect domain-general mechanisms – cognitive tools that are supposed to be jacks-of-all-trades and masters of none – to even exist. This will largely be accomplished by pointing out some of the ways that Chiappe & MacDonald (2005) err in their analysis of domain-general and domain-specific modules. While there is a lot wrong with their paper, I will only focus on certain key conceptual issues, the first of which involves the idea, again, the domain-specific mechanisms are incapable of dealing with novelty (in much the same way that a butter knife is clearly incapable of doing anything that doesn’t involve cutting and spreading butter).

Chiappe & MacDonald claim that a modular design in the mind should imply inflexibility: specifically, that organisms with modular minds should be unable to solve novel problems or solve non-novel problems in novel ways. A major problem that Chiappe & MacDonald’s account encounters is a failure to recognize that all problems organisms face are novel, strictly speaking. To clarify that point, consider a predator/prey relationship: while rabbits might be adapted for avoiding being killed by foxes, generally speaking, no rabbit alive today is adapted to avoid being killed by any contemporary fox. These predator-avoidance systems were all designed by selection pressures on past rabbit populations. Each fox that a rabbit encounters in its life is a novel fox, and each situation that fox is encountered in is a novel situation. However, since there are statistical similarities between past foxes and contemporary ones, as well as between the situations in which they’re encountered, these systems can still respond to novel stimuli effectively. This evaporates the novelty concern rather quickly; domain-specific modules can, in fact, only solve novel problems, since novel problems are the only kinds of problems that an organism will encounter. How well they will solve those problems will depend in large part on how much overlap there is between past and current scenarios.

Swing and a miss, novelty problem…

A second large problem in the account involves the lack of distinction on the part of Chiappe and MacDonald between the specificity of inputs and of functions. For example, the authors suggest that our abilities for working memory should be classified as domain-general abilities because many different kinds of information can be stored in working memory. This strikes me as a rather silly argument, as it could be used to classify all cognitive mechanisms as domain-general. Let’s return to our knife example; a knife can be used for cutting all sorts of items: it could cut bread, fabric, wood, bodies, hair, paper, and so on. From this, we could conclude that a knife is a domain-general tool, since its function can be used towards a wide-variety of problems that all involve cutting. On the other hand, as mentioned previously, a knife can efficiently do far fewer things than what it can’t do: knives are awful hammers, fire extinguishers, water purifiers, and information-storage devices. The knife has a relatively specific function which can be effectively applied to many problems that all require the same general solution – cutting (provided, of course, the materials are able to be cut by the knife itself. That I might wish to cut through a steel door does not mean my kitchen knife is up to the task). To tie this back to working memory, our cognitive systems that dabble in working memory might be efficient at holding many different sorts of information in short-term memory, but they’d be worthless at doing things like regulating breathing, perceiving the world, deciphering meaning, or almost any other task. While the system can accept a certain range of different kinds of inputs, its function remains constant and domain-specific.

Finally, there is the largest issue their model encounters. I’ll let Chiappe & MacDonald spell it out themselves:

A basic problem [with domain-general modules] is that there are no problems that the system was designed to solve. The system has no preset goals and no way to determine when goals are achieved, an example of the frame problem discussed by cognitive scientists…This is the problem of relevance – the problem of determining which problems are relevant and what actions are relevant for solving them. (p.7)

Though they mention this problem in the beginning of their paper, the authors never actually take any steps to address that series of rather large issues. No part of their account deals with how their hypothetical domain-general mechanisms generate solutions to novel problems. As far as I can tell, you could replace the processes by which their domain-general mechanisms identify problems, figure out which information is and isn’t useful in solving said problems, figure out how to use that information to solve the problems, and figure out when the problem has been solved, with the phrase “by magic” and not really affect the quality of their account much. Perhaps “replace” is the wrong word, however, as they don’t actually put forth any specifics as to how these tasks are accomplished under their perspective. The closest they seem to come is when they write things along the lines of “learning happens” or “information is combined and manipulated” or “solutions are generated”. Unfortunately for their model, leaving it at that is not good enough.

A lesson that I thought South Park taught us long time ago.

In summary, their novelty problem isn’t one, their “domain-general” systems are not general-purpose at the functional level at all, and the ever-present framing problem is ignored, rather than addressed. That does not leave much of an account left. While, as the authors suggest, being able to adaptively respond to non-recurrent features in our environment would probably be, well, adaptive, so would the ability to allow our lungs to become more “general-purpose” in the event we found ourselves having to breathe underwater. Just because such abilities would be adaptive, however, does not mean that they will exist.

As the classic quote goes, there are far more ways of being dead than there are of being alive. Similarly, there are far more ways of not generating adaptive behavior than there are of behaving adaptively. Domain-general information processors that don’t “know” what to do with the information they receive will tend to get things wrong far more often than they’ll get them right on those simple statistical grounds. Sure, domain-specific information processors won’t always get the right answer either, but the pressing question is, “compared to what?”. If that comparison is made to a general-purpose mechanism, then there wouldn’t appear to be much of a contest.

References: Chiappe, D., & MacDonald, K. (2005). The Evolution of Domain-General Mechanisms in Intelligence and Learning The Journal of General Psychology, 132 (1), 5-40 DOI: 10.3200/GENP.132.1.5-40

Pop Psychology

The Internet's Best Evolutionary Psycholo-guy

Category Archives: Philosophy