The Sometimes Significant Effects Of Sexism

Earlier this week I got an email from a reader, Patrick, who recommended I review a paper entitled, “More than “just a joke”: The prejudice-releasing function of sexist humor” by Ford et al (2007). As I happen to find discussions of sexism quite interesting and this article reviewable, I’m happy to share some of my thoughts about it. I would like to start by noting that the title of the article is a bit off-putting for me due to the author’s use of word “function”. While I’m all for more psychological research taking a functionalist perspective, their phrasing would, at least in my academic circle, carry the implication that the use of sexist humor evolved because of its role in releasing prejudice, and I’m fairly certain that is not the conclusion that Ford et al (2007) intended to convey. Though Ford et al (2007) manage to come close to something resembling a functionalist account (there is some mention of costs and the avoiding of them), it’s far from close enough for the coveted label of “function” to be applied. Until their account has been improved in that regard, I feel the urge to defend my academic semantic turf.

So grab your knife and best dancing shoes.

Concerning the study itself, Ford et al (2007) sought to demonstrate that sexist humor would lead men who were high in “hostile sexism” to act in a discriminatory fashion towards women. Broadly speaking, the authors suggest that people who hold sexist beliefs often try to suppress the expression of those beliefs in the hopes of avoiding condemnation by others who are less sexist; however, when condemnation is perceived to be unlikely, those who hold sexist beliefs will stop suppressing them, at least to some extent. The authors further suggest that humor can serve to create an atmosphere where condemnation of socially unpopular views is perceived as less likely. Stringing this all together, we end up with the conclusion that sexist humor can create an environment that sexist men will perceive as more welcoming for their sexist attitudes, which they will subsequently be more likely to express.

Across two studies, Ford et al (2007) found support for this conclusion. In the first study, male subject’s hostile sexism scores were assessed through the “Ambivalent Sexism Inventory” 2 to 4 weeks prior to the study. The study itself involved presenting the males with one of three vignettes that included sexist jokes, sexist statements, or neutral jokes, followed by asking them how much they would hypothetically be willing to donate to a woman’s organization. The results showed that while measures of hostile sexism alone did not predict how much men were willing to donate, when confronted with sexist humor, those men who scored higher on the sexism measure tended to donate less of their hypothetical $20 to a woman’s group. Further, neither the sexist statements or neutral joke conditions had any effect on a man’s willingness to donate, regardless of his sexism score. In fact, though it was not significant, men who scored higher in hostile sexism were more likely to donate more to a woman’s group relative to those who scored low, following the sexist statements.

There are two important points to consider regarding this first set of findings. The first of these points relates to the sexist statement condition: if the mechanism through which Ford et al (2007) are proposing hostile sexism becomes acted upon is the perception of tolerance of sexist beliefs, the sexist statements condition is rather strange. In that condition, it would appear rather unambiguous that there is a local norm for the acceptance of sexism against women, yet the men high in sexism don’t seem to “release” theirs. This is a point that the authors don’t engage with, and that seems like a rather major oversight. My guess is that the point isn’t engaged in because it would be rather difficult for the author’s model to account for, but perhaps they had some other reason for not considering it (even though it easily could have easily and profitably been examined in their second study). The second point that I wanted to deal with concerns the way in which Ford et al (2007) seem to write about sexism. Like many people (presumably) concerned about sexism, they only appear concerned with one specific type of sexism: the type where men appear biased against women.

“If he didn’t want to get hit, he shouldn’t have provoked her!”

For instance, in the first study, the authors report, truthfully, that men higher in hostile sexism tended to donate less to a woman’s group then men lower in hostile sexism did. What they do not explicitly mention is how those two groups compare to a control: the neutral joke condition. Regardless of one’s sexism score, people donated equally in the neutral condition. In the sexist joke condition, those men high in hostile sexism donate less, relative to the neutral condition; on the other hand, those men low in hostile sexism donated more in the sexist humor condition, relative to the neutral control. While the former is taken as an indication of a bias against women, the latter is not discussed as a bias in favor of women. As I’ve written about before, only discussing one set of biases does not appear uncommon when it comes to sexism research, and I happen to find that peculiar. This entire study is dedicated towards looking at the result of ostensibly sexist attitudes held by men against women; there is no condition where women’s biases (either towards men or women) are assessed before and after hearing sexist jokes about men. Again, this is a rather odd omission if Ford et al (2007) are seeking to study gender biases (that is, unless the costs for expressing sexist beliefs are lower for women, though this point is never talked about either). The authors do at least mention in a postscript that women’s results on the hostile sexism scale don’t predict their attitudes towards other women, which kind of calls into question what this measure of “hostile sexism” is actually supposed to be measuring (but more on that later).

The second study had a few (only 15 per condition) male subjects watching sexist or non-sexist comedic skits in small groups, after which they were asked to fill out a measure concerning how they would allocate a 20% budget cut among 5 different school groups, one of which was a woman’s group (the others were an African American, Jewish, study abroad, and Safe Arrival for Everyone). Following this, subjects were asked how people in their group might approve of their budget cuts, as well as how others outside of the group might approve of their cuts. As before, those who were higher in hostile sexism were more likely to reduce more of the budget of the woman’s group, but only in the sexist joke condition. Those with higher hostile sexism scores were also more likely to deduct more money from the African American group as well, but only in the neutral humor condition, though little is said about this effect (the authors do mention it is unlikely to be driven by the same mechanism; I think it might just reflect chance). Those in the high sexism, sexist humor group were also likely to believe that others in their condition would approve of their budget reductions to the woman’s group, though they were no more likely to think students at large would approve of their budget cuts than other groups.

The sample size for this second study is a bit concerning. There were 30 subjects total, across two groups, and the sample was further divided by sexism scores. If we assume there were equal numbers of subjects with high and low sexism scores, we’re looking at only about 7 or 8 subjects per condition, and that’s only if we divide the sample by sexism scores above and below the mid-point. I can’t think of a good reason for collecting such a small sample, and I have some concerns that it might reflect data-peaking, though I have no evidence that it does. Nevertheless, the authors make a lot of the idea that subjects higher in sexism felt there was more consensus about the local approval rating of their budget cuts, but only in the sexist humor condition; that is, they might have perceived the local norms about sexism to be less condemning of sexist behavior following the sexist jokes. As I mentioned before, it would have been a good idea for them to test their mechanism using other, non-humor conditions, such as the sexist statement they used initially and subsequently dropped. There’s not much more to say about that except that Ford et al (2009) mention in their introduction the statement manipulations seemed to work as a releaser for racial bias without mentioning why they didn’t work for gender.

So it might be safer to direct any derogatory comments you have to the right…

I would like to talk a bit more about the Ambivalent Sexism Inventory before finishing up. I took the test online in order to see what items were being used as research for this post (and you can as well, by following the link above) and I have some reservations as to what precisely it’s measuring. Rather than measuring sexism, per se, the hostile portion of the inventory appears to deal, at least in part, with whether one or one agrees with certain feminist ideas. For instance, two questions which stand out as being explicit about this are (1) “feminists are making entirely reasonable demands of men”, and (2) “feminists are not seeking for women to have more power than men”. Not only do such questions not necessarily reflect one’s views of women more generally (provided one can be said to have a view of women more generally), but they are so hopelessly vague in their wording that they can be interpreted to have a unacceptably wide range of meanings. As the two previous studies and the footnote demonstrate, there doesn’t seem to be a consistent main effect of one’s score on this test, so I have reservations as to whether it’s really tapping sexism per se.

The other part of the sexism inventory involves what is known as “benevolent sexism” – essentially the notion that men ought to do things for women, like gain their affection or protect and provide for them, or that women are in some respects “better” than men. As the website with the survey helpfully informs us, men and women don’t seem to differ substantially in this type of sexism. However, this type of benevolent sexism is also framed as sexism against women that could turn “ugly” for them; not as sexism directed against men, which I find curious, given certain questions (such as, “Women, compared to men, tend to have a superior moral sensibility.” or “Men should be willing to sacrifice their own well being in order to provide financially for the women in their lives.”). Since this is already getting long, I’ll just wonder aloud why no data of the measures of benevolent sexism appear in this paper anywhere, given that the authors appears to have collected it.

References: Ford, T., Boxer, C., Armstrong, J., & Edel, J. (2007). More Than “Just a Joke”: The Prejudice-Releasing Function of Sexist Humor Personality and Social Psychology Bulletin, 34 (2), 159-170 DOI: 10.1177/0146167207310022


Does “Statistical Significance” Imply “Actually Significant”?

P-values below 0.05; the finding and reporting of these values might be considered the backbone of most psychological research. Conceptually, these values are supposed to represent the notion that, if the null hypothesis is true, the odds of observing some set of results are under 5%. As such, if one observes a result unlikely to be obtained by chance, this would seem to carry the implication that the null hypothesis is unlikely to be true and there are likely real differences between the group means under examination. Despite null hypothesis significance testing becoming the standard means of statistical testing in psychology, the method is not without its flaws, both on the conceptual and practical levels. According to a paper by Simmons et al (2011), on the practical end of things, some of the ways in which researchers are able to selectively collect and analyze data can dramatically inflate the odds of obtaining a statistically significant result.

Don’t worry though; it probably won’t blow up in your face until much later in your career.

Before getting to their paper, it’s worth covering some of the conceptual issues inherent with null hypothesis significance testing, as the practical issues can be said to apply just as well to other kinds of statistical testing. Brandstaetter (1999) raises two large concerns about null hypothesis significance testing, though really they’re more like two parts of the same concern, and, ironically enough, almost sound as if they’re opposing points. The first part of this concern is that classic significance testing does not tell us whether the results we observed came from a sample with a mean that was actually different from the null hypothesis. In other words, a statistically significant result does not tell us that the null hypothesis is false; in fact, it doesn’t even tell us the null hypothesis is unlikely. According to Brandstaetter (1999), this is due to the logic underlying significance testing being invalid. The specific example that Brandstaetter uses references the rolling of dice: if you roll a twenty-sided die, it’s unlikely (5%) that you will observe a 1; however, if you observe a 1, it doesn’t follow that it’s unlikely you rolled the die.

While that example addresses null hypothesis testing at a strictly logical level, this objection can be dealt with fairly easily, I feel: in Brandstaetter’s example, the hypothesis that one would be testing is not “the die was rolled”, so that specific example seems a bit strange. If you were comparing the heights of two different groups (say, men and women), and you found one group was, in your sample, an average of six inches, it might be reasonable to conclude that it’s unlikely that the population means that the two samples come from are the same. This is where the second part of the criticism comes into play: in reality, the means of different groups are almost guaranteed to be different in some way, no matter how small or large that difference is. This means that, strictly speaking, the null hypothesis (there is no mean difference) is pretty much always false; the matter then becomes whether your test has enough power to reach statistical significance, and increasing your sample size can generally do the trick in that regard. So, in addition to not telling us about whether the null hypothesis is true or false, the best that this kind of significance testing can do is tell us a specific value  that a population mean is not. However, since there are an infinite number of possible values that a population mean could hypothetically take, the value of this information may be minimal.

Even in the best of times, then, significance testing has some rather definite conceptual concerns. These two conceptual issues, however, seem to be overshadowed in importance by that practical issues that arise during the conducting of research; what Simmons et al (2011) call “researcher degrees of freedom”. This term is designed to capture some of the various decisions that researchers might make over the course of collecting and analyzing data while hunting for statistically significant results capable of being published. As publications are important for any researcher’s career, and statistically significant results are the kind that are most likely to be published (or so I’ve been told), this combination of pressures can lead to researchers making choices – albeit not typically malicious ones – that increase their chances of finding such results.

“There’s a significant p-value in this mountain of data somewhere, I tell you!”

Simmons et al (2011) began by generating random samples all pulled from a normal distribution across 15,000 independent simulations. Since they were testing for how often statistically significant effects were found, if they were using classic significance testing, that rate should not tend to exceed 5%. When there were two dependent measures capable of being analyzed (in their example, these were willingness to pay and liking), the ability to analyze these two measures separately or in combination nearly doubled the chances of finding a statistically significant “effect” at the 0.05 level. That is to say, the odds of finding an effect by chance were no longer 5%, but closer to 10%. A similar effect was found given the researchers controlled for gender. This makes intuitive sense, as it’s basically the same manipulation as the former two-measure case, just with a different label.

There’s similar bad news for the peak-and-test method that some researchers make use of with their data. In these cases, a researcher will collect some number of subjects for each condition – say 20 – and conduct a test to see if they found an effect. If an effect is found, the researcher will stop collecting data; if the effect isn’t found, the researcher will then collect another number of observations per condition – say another 10 – and then retest for significance. A researcher’s ability to peak at their data increased the odds of finding an effect by chance up to about 8%. Finally, if the researcher decides to run multiple levels of a condition (Simmons et al’s example concerned splitting the sample into low, medium, and high conditions), the ability to selectively compare these conditions to each other brought the false positive rate up to 12.6%. Worrying, if these four degrees of researcher freedom were combined, the odds of finding a false positive were as high as 60%; that is, the odds are better that you would find some effect strictly by chance than you wouldn’t. While these results might have been statistically significant, they are not actually significant. This is a fine example of Brandstaetter’s (1999) initial point: significance testing does not tell us that the null hypothesis is true or likely, as it should have been in all these cases.

As Simmons et al (2011) also note, this rate of false positives might even be conservative, given that there are other, unconsidered liberties that researchers can take. Making matters even worse, there’s the aforementioned publication bias, in that, at least as far as I’ve been led to believe, journals tend to favor publications that (a) find statistically significant results and (b) are novel in their design (i.e. journals tend to not publish replications). This means that when false positives are found, they’re both more likely to make their way into journals and less likely to subsequently be corrected. In turn, those false positives could lead to poor research outcomes, such as researchers wasting time and money chasing effects that are unlikely to be found again, or actually reinforcing the initial false-positive in the event they go chasing after it, it actually is found by chance, and subsequently published again.

“With such a solid foundation, it’s difficult to see how this could have happened”

Simmons et al (2011) do put forth some suggestions as to how these problems could begin to be remedied. While I think their suggestions are all, in the abstract, good ideas, they would likely also generate a good deal more paperwork for researchers to deal with, and I don’t know a researcher alive who craves more paperwork. While there might be some tradeoff, in this case, between some amount of paperwork and eventual research quality, there is one point that Simmons et al (2011) do not discuss when it comes to remedying this issue, and that’s the matter I have been writing about for some time: the inclusion of theory in research. In my experience, a typical paper in psychology will give one of two explicit reasons for its being conducted: (1) an effect was found previously, so the researchers are looking to either find it again (or not find it), or (2) the authors have a hunch they will find an effect. Without an real theoretical framework surrounding these research projects, there is little need to make sense of or actually explain a finding; one can simply say they discovered a “bias” or a “cognitive blindness” and leave it at that. While I can’t say how much of the false-positive problem could be dealt with by requiring the inclusion of some theoretical framework for understanding one’s results when submitting a manuscript, if any, I feel some theory requirement would still go a long way towards improving the quality of research that ends up getting published. It would encourage researchers to think more deeply about why they’re doing what they’re doing, as well as help readers to understand (and critique) the results they end up seeing. While dealing with false positives should certainly be a concern, merely cutting down on their appearance is not be enough to help research quality in psychology progress appreciably.

References: Brandstaetter (1999). Confidence intervals as an alternative to significance testing. Methods of Psychological Researcher Online, 4.

Simmons, J., Nelson, L., & Simonsohn, U. (2011). False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant Psychological Science, 22 (11), 1359-1366 DOI: 10.1177/0956797611417632

A Frequentist And A Bayesian Walk Into Infinity…

I’m going to preface this post by stating that statistics is not my primary area of expertise. Admittedly, this might not be the best way of generating interest, but non-expertise hasn’t seem to have stopped many a teacher or writer, so I’m hoping it won’t be too much of a problem here. This non-expertise, however, has apparently also not stopped me from stumbling upon an interesting question concerning Bayesian statistics. Whether this conceptual problem I’ve been mulling over would actually prove to be a problem in real-world data collection is another matter entirely. Then again, there doesn’t appear to be a required link between academia and reality, so I won’t worry too much about that while I indulge in the pleasure of a little bit of philosophical play time.

The link between academia and reality is about as strong as the link between my degree and a good job.

So first, let’s run through a quick problem using Bayesian statistics. This is the classic example that I was introduced to the idea by: say that you’re doctor trying to treat an infection that has broken out among a specific population of people. You happened to know that 5% of the people in this population are actually infected and you’re trying to figure out who those people are so you can at least quarantine them. Luckily for you, you happen to have a device that can test for the presence of this infection. If you use this device to test an individual who actually has the disease, it will come back positive 95% of the time; if the individual does not have the disease, it will come back positive 5% of the time. Given that an individual has tested positive for the disease, what is the probability that they actually have it? The answer, unintuitive to most, is 50%.

Though the odds of someone testing positive if they have the disease are high (95%), very few people actually have the disease (5%). So 5% of the 95% of the people who don’t have an infection will test positive and 95% of the of 5% of people who do have an infection also will. In case that example ran by too quickly, here’s another brief video example using hipsters drinking beer over treating infection. This method of statistical testing would seem to have some distinct benefits: for example, it will tell you the probability of your hypothesis, given your data, rather than the probability of your data, given your hypothesis (which, I’m told, is what most people actually want to be calculating). That said, I see two (possibly major) conceptual issue with this type of statistical analysis. If anyone more versed in these matters feels they have good answers to them, I’d be happy to hear it in the comments section.

The first issue was raised by Gelman (2008), who was discussing the usefulness of our prior knowledge. In the above examples, we know some information ahead of time (the prevalence of an infection or hipsters); in real life, we frequently don’t know this information; in fact, it’s often what we’re trying to estimate when we’re doing our hypothesis tests. This puts us in something of a bind when it comes to using Bayes’ formula. Lacking objective knowledge, one could use what are called subjective priors, which represent your own set of preexisting beliefs about how likely certain hypotheses are. Of course, subjective priors have two issues: first, they’re unlikely to be shared uniformly between people, and if your subjective beliefs are not my subjective beliefs, we’ll end up coming to two different conclusions given the same set of data. It’s also probably worth mentioned that subjective beliefs do not, to the best of my knowledge, actually effect the goings-on in the world: that I believe it’s highly probable it won’t rain tomorrow doesn’t matter; it either will or I won’t, and no amount of belief will change that. The second issue concerns the point of the hypothesis test; if you already have a strong prior belief about the truth of a hypothesis, for whatever reason you do, that would seem to suggest there’s little need for you to actually collect any new data.

On the plus side, doing research just got way easier!

One could attempt to get around this problem by using a subjective, but uninformative prior; that is, distribute your belief uniformly over your set of possible outcomes, or to enter into your data analysis with no preconceptions about how it’ll turn out. This might seem like a good solution to the problem, but it would also seem to make your priors all but useless. If you’re multiplying by the same constant, you can just drop it from your analysis. So it would seem in both cases, priors don’t do you a lot of good: they’re either strong, in which case you don’t need to collect more data, or uninformative, in which case they’re pointless to include in the analysis. Now perhaps there are good arguments to be made for subjective priors, but that’s not the primary point I hoped to address; my main criticism involves what’s known as the gambler’s fallacy.

This logical fallacy can be demonstrated with the following example: say you’re flipping a fair coin; given that this coin has come up heads 10 times in a row, how likely will the probability of a tails outcome be on the next flip? The answer, of course, is 50%, as a fair coin is one that is unbiased with respect to which outcome will obtain when you flip it; the probability of a heads outcome using this coin is always as likely as a tails outcome. However, someone making the gambler’s fallacy will suggest that the coin is more likely to come up tails, as all the heads outcomes makes the tails outcome feel more likely; as if a tails outcome is “due” to come up. This is incorrect, as each flip of this coin is independent of the other flips, so knowing what the previous outcomes of this coin have been tell you nothing about what the future outcomes of the coin will be, or, as others have put it, the coin has no memory. As I see it, Bayesian analysis could lead one to engaging in this fallacy (or, more precisely, something like the reverse gambler’s fallacy).

Here’s the example I’ve been thinking about: consider that you have a fair coin and an infinite stretch of time over which you’ll be flipping it. Long strings of heads or tails outcomes (say 10,000 in a row, or even 1,000,000 and beyond in a row) are certainly improbable, but given an infinite amount of time, they become an inevitability outcomes that will obtain eventually. Now, if you’re a good Bayesian, you’ll update your posterior beliefs following each outcome. In essence, after a coin comes up heads, you’ll be more likely to think that it will come up heads on the subsequent flip; since heads have been coming up, more heads are due to come up. Essentially, you’ll be suggesting that these independent events are not actually independent of each other, at least with respect to your posterior beliefs. Given these long strings of heads and tails which will inevitably crop up, over time you will go from believing the coin is fair, to believing that it is nearly completely biased towards both heads and tails and back again.

Though your beliefs about the world can never have enough pairs of flip-flips…

It seems to me, then, that you want some statistical test that will, to some extent, try and take into account data that you did not obtain, but might have if you want to more accurately estimate the parameter (in this case, the fairness of the coin: what might have happened if I flipped the coin another X number of times). This is, generally speaking, anathema to Bayesian statistics as I understand it, who only concern themselves with the data that was collected. Of course, that does raise the question of how one can accurately predict what data they might have obtained, but did not, for which I don’t have a good answer. There’s also the matter of precisely how large of a problem this hypothetical example poses for Bayesian statistics when you’re not dealing with an infinite number of random observations; in the real world, this conceptual problem might not be much of one as these events are highly improbable, so it’s rare that anyone will actually end up making this kind of mistake. That said, it is generally a good thing to be as conceptually aware of possible problems as we can be if we want any hope of fixing them.

References: Gelman, A. (2008). Objections to Bayesian statistics Bayesian Analysis, 3, 445-450 DOI: 10.1214/08-BA318

(Not So) Simple Jury Persuasion: Beauty And Guilt

It should come as no shock to anyone, really, that people have all sorts of interesting cognitive biases. Finding and describing these biases would seem to make up a healthy portion of the research in psychology, and one can really make a name for themselves if the cognitive bias they find happens to be particularly cute. Despite this well-accepted description of the goings-on in the human mind (it’s frequently biased), most research in the field of psychology tends to overlook, explicitly or implicitly, those ever-important “why” questions concerning said biases; the paper by Herrera et al (2012) that I’ll be writing about today (and the Jury Room covered recently) is no exception, but we’ll deal with that in a minute. Before I get to this paper, I would like to talk briefly about why we should expect cognitive biases in the most general terms.

Hypothesis 1: Haters gonna hate?

When it comes to the way our mind perceives and processes information, one might consider two possible goals for those perceptions: (1) being accurate – i.e. perceiving the world in an “objective” or “correct” way – or (2) doing (evolutionarily) useful things. A point worth bearing in mind is that the latter goal is the only possible route by which any cognitive adaptation could evolve; a cognitive mechanism that did not eventually result in a reproductive advantage would, unsurprisingly, not be likely to spread throughout the population. That’s most certainly not to say that accuracy doesn’t matter; it does, without question. However, accuracy is only important insomuch as it leads to doing useful things. Accuracy for accuracy’s sake is not even a potential selection pressure that could shape our psychology. While, generally speaking, having accurate perceptions can often lead towards adaptive ends, when those two goals are in conflict, we should expect doing useful things to win every time, and, when that happens, we should see a cognitive bias as the result.

A quick example can drive this point home: your very good friend finds himself in conflict with a complete stranger. You have arrived late to the scene, so you only have your friend’s word and the word of the stranger as to what’s going on. If you were an objectively accurate type, you might take the time to listen to both of their stories carefully, do your best to figure out how credible each party is, find out who was harmed and how much, and find the “real” victim in the altercation. Then, you might decide whether or not to get involved on the basis of that information. Now that may sound all well and good, but if you opt for this route you also run the risk of jeopardizing your friendship to help out a stranger, and losing the benefits of that friendship is a cost. Suffering that cost is, all things considered, evolutionarily, would be a “bad” thing, even if uninvolved parties might consider it to be it the morally correct action (skirting for the moment the possibility of costs that other parties might impose, though avoiding those could easily be fit in the “doing useful things” sides of the equation). This suggests that, all else being equal, there should be some bias that pushes people towards siding with their friends, as siding against them is a costlier alternative.

So where all this leads us is to the conclusion that when you see someone proposing that a cognitive bias exists, they are, implicitly or explicitly, suggesting that there is a conflict between accuracy and some cost of that accuracy, be that conflict over behaving in a way that generates an adaptive outcome, trade-offs between cognitive costs of computation and accuracy, or anything else. With that out of the way, we can now consider the paper by Herrera et al (2012) that purports to find a strange cognitive bias when it comes to the interaction of (a) perceptions of credibility, responsibility, and control of a situation when it comes to domestic violence against women, (b) their physical attractiveness, and (c) their prototypicality as a victim. According to their results, attractiveness might not always be a good thing.

Though, let’s face it, attractiveness is, on the whole, typically a good thing.

In their study, Herrera et al (2012) recruited a sample of 169 police offers (153 of which were men) from various regions of Spain. They were divided into four groups, each of which read a different vignette about a hypothetical woman who had filed a self-defense plea for killing her husband by stabbing him in the back several times, citing a history of domestic abuse a fear that he would have killed her during an argument. The woman in these stories – Maria – was either described as attractive or unattractive (no pictures were actually included) along the following lines: thick versus thin lips, smooth features versus stern and jarring ones, straight blonde hair versus dark bundled hair, and slender versus non-slender appearance. In terms of whether Maria was a prototypical battered woman, she was either described as having 2 children, no job with an income, hiding her face during the trial, being poorly dressed, and timid in answering questions, or as having no children, a well-paying job, being well dressed, and resolute in her interactions.

Working under the assumption that these manipulations are valid (I feel they would have done better to have used actual pictures of women rather than brief written descriptions, but they didn’t), the authors found an interesting interaction: when Maria was attractive and prototypical, she was rated as being more credible than when she was unattractive and prototypical (4.18 vs 3.30 out of 7). The opposite pattern held for when Maria was not prototypical; here, attractive Maria was rated as being less credible than her unattractive counterpart (3.72 vs 3.85). So, whether attractiveness was a good or a bad thing for Maria’s credibility depended on how well she otherwise met some criteria for your typical victim of domestic abuse. On the other hand, more responsibility was attributed to Maria for the purported abuse when she was attractive overall (5.42 for attractive, 5.99 for unattractive).

Herrera et al (2012) attempt to explain the attractiveness portion of their results by suggesting that attractiveness might not fit in with the prototypical picture of a female version of domestic abuse, which results in less lenient judgments of their behavior. It seems to me this explanation could have been tested with the data they collected, but they either failed to do so or did and did not find significant results. More to the point, this explanation is admittedly strange, given that attractive women were also rated as more credible when they were otherwise prototypical, and the author’s proximate explanation should, it seems, predict precisely the opposite pattern in that regard. Perhaps they might have had ended up with a more convincing explanation for their results had their research been guided with some theory as to why we should see these biases with regard to attractiveness, (i.e. what the conflict in perception should be being driven by) but it was not.

I mean, it seems like a handicap to me, but maybe you’ll find something worthwhile…

There was one final comment in the paper I would like to briefly consider with regard to what the authors consider two fundamental due process requirements in cases of women’s domestic abuse: (1) the presumption of innocence on the part of the woman making the claim of abuse and (2) the woman’s right to a fair hearing without the risk of revictimization; revictimization, in this case, referring to instances where the woman’s claims are doubted and her motives are called into question. What is interesting about that claim is that it would seem to set up an apparently unnoticed or unmentioned double-standard: it would seem to imply that women making claims of abuse are supposed to be, by default, believed; this would seem to do violence to the right that the potential perpetrator is supposed to have with regard to their presumption of innocence. Given that part of the focus of this research is on the matter of credibility, this unmentioned double-standard seems out of place. This apparent oversight might have to do with the fact that this research was only examining moral claims made by a hypothetical woman, rather than another claim also made by a man, but it’s hard to say for sure.

References: Herrera, A., Valor-Segura, I., & Expósito, F (2012). Is Miss Sympathy a Credible Defendant Alleging Intimate Partner Violence in a Trial for Murder? The European Journal of Psychology Applied to Legal Context, 4, 179-196