More About Psychology Research Replicating

By now, many of you have no doubt heard about the reproducibility project, where 100 psychological findings were subjected to replication attempts. In case you’re not familiar with it, the results of this project were less than a ringing endorsement of research in the field: of the expected 89 replications, only 37 were obtained and the average size of the effects fell dramatically; social psychology research in particular seemed uniquely bad in this regard. This suggests that, in many cases, one would be well served by taking many psychological findings with a couple grains of salt. Naturally, this leads many people to wonder whether there’s anyway they might be more confident that an effect is real, so to speak. One possible means through which your confidence might be bolstered is whether or not the research in question contains conceptual replications. What this refers to are cases where the authors of a manuscript report the results of several different studies purporting to measure the same underlying thing with varying methods; that is, they are studying topic A with methods X, Y, and Z. If all of these turn up positive, you ought to be more confident that an effect is real. Indeed, I have had a paper rejected more than once for only containing a single experiment. Journals often want to see several studies in one paper, and that is likely part of the reason why: a single experiment is surely less reliable than multiple ones.

It doesn’t go anywhere, but at least it does so reliably

According to the unknown moderator account of replication failure, psychological research findings are, in essence, often fickle. Some findings might depend on the time of day that measurements were taken, the country of the sample, some particular detail of the stimulus material, whether the experimenter is a man or a woman; you name it. In other words, it is possible that these published effects are real, but only occur in some rather specific contexts of which we are not adequately aware; that is to say they are moderated by unknown variables. If that’s the case, it is unlikely that some replication efforts will be successful, as it is quite unlikely that all of the unique, unknown, and unappreciated moderators will be replicated as well. This is where conceptual replications come in: if a paper contains two, three, or more different attempts at studying the same topic, we should expect that the effect they turn up is more likely to extend beyond a very limited set of contexts and should replicate more readily.

That’s a flattering hypothesis for explaining these replication failures; there’s just not enough replication going on prepublication, so limited findings are getting published as if they were more generalizable. The less-flattering hypothesis is that many researchers are, for lack of a better word, cheating by employing dishonest research tactics. These tactics can include hypothesizing after data is collected, only collecting participants until the data says what the researchers want and then stopping, splitting samples up into different groups until differences are discovered, and so on. There’s also the notorious issue of journals only publishing positive results rather than negative ones (creating a large incentive to cheat, as punishment for doing so is all but non-existent so long as you aren’t just making up the data). It is for these reasons that requiring the pre-registering of research – explicitly stating what you’re going to look at ahead of time – drops positive findings markedly. If research is failing to replicate because the system is being cheated, more internal replications (those from the same authors) don’t really help that much when it comes to predicting external replications (those conducted by outside parties). Internal replications just provide researchers the ability to report multiple attempts at cheating.

These two hypotheses make different predictions concerning the data from the aforementioned reproducibility project: specifically, research containing internal replications ought to be more likely to successfully replicate if the unknown moderator hypothesis is accurate. It certainly would be a strange state of affairs from a “this finding is true” perspective if multiple conceptual replications were no more likely to prove reproducible than single-study papers. It would be similar to saying that effects which have been replicated are no more likely to subsequently replicate than effects which have not. By contrast, the cheating hypothesis (or, more politely, questionable research practices hypothesis) has no problem at all with the idea that internal replications might prove to be as externally replicable as single-study papers; cheating a finding out three times doesn’t mean it’s more likely to be true than cheating it out once.

It’s not cheating; it’s just a “questionable testing strategy”

This brings me to a new paper by Kunert (2016) who reexamined some of the data from the reproducibility project. Of the 100 original papers, 44 contained internal replications: 20 contained just one replication, 10 were replicated twice, 9 were replicated 3 times, and 5 contained more than three. These were compared against the 56 papers which did not contain internal replications to see which would subsequently replicate better (as measured by achieving statistical significance). As it turned out, papers with internal replications externally replicated about 30% of the time, whereas papers without internal replications externally replicated about 40% of the time. Not only were the internally-replicated papers not substantially better, they were actually slightly worse in that regard. A similar conclusion was reached regarding the average effect size: papers with internal replications were no more likely to subsequently contain a larger effect size, relative to papers without such replications.

It is possible, of course, that papers containing internal replications are different than papers which do not contain such replications. This means it might be possible that internal replications are actually a good thing, but their positive effects are being outweighed by other, negative factors. For example, someone proposing a particularly novel hypothesis might be inclined to include more internal replications in their paper than someone studying an established one; the latter researcher doesn’t need more replications in his paper to get it published because the effect has already been replicated in other work. Towards examining this point, Kunert (2016) made use of the 7 identified reproducibility predictors from the Open Science Collaboration – field of study, effect type, original P-value, original effect size, replication power, surprisingness of original effect, and the challenge of conducting the replication – to assess whether internally-replicated work differed in any notable ways from the non-internally-replicated sample. As it turns out, the two samples were pretty similar overall on all the factors except one: field of study. Internally-replicated effects tended to come from social psychology more frequently (70%) than cognitive psychology (54%). As I mentioned before, social psychology papers did tend to replicate less often. However, the unknown moderator effect was not particularly well supported for either field when examined individually.

In summary, then, papers containing internal replications were no more likely to do well when it came to external replications which, in my mind, suggests that something is going very wrong in the process somewhere. Perhaps researchers are making use of their freedom to analyze and collect data as they see fit in order deliver the conclusions they want to see; perhaps journals are preferentially publishing the findings of people who got lucky, relative to those who got it right. These possibilities, of course, are not mutually exclusive. Now I suppose one could continue to make an argument that goes something like, “papers that contain conceptual replications are more likely to be doing something else different, relative to papers with only a single study,” which could potentially explain the lack of strength provided by internal replications, and whatever that “something” is might not be directly tapped by the variables considered in the current paper. In essence, such an argument would suggest that there are unknown moderators all the way down.

“…and that turtle stands on the shell of an even larger turtle…”

While it’s true enough that such an explanation is not ruled out by the current results, it should not be taken as any kind of default stance on why this research is failing to replicate. The “researchers are cheating” explanation strikes me as a bit more plausible at this stage, given that there aren’t many other obvious explanations for why ostensibly replicated papers are no better at replicating. As Kunert (2016) plainly puts it:

This report suggests that, without widespread changes to psychological science, it will become difficult to distinguish it from informal observations, anecdotes and guess work.

This brings us to the matter of what might be done about the issue. There are procedural ways of attempting to address the problem – such as Kunert’s (2016) recommendation for getting journals to publish papers independent of their results – but my focus has, and continues to be, on the theoretical aspects of publication. Too many papers in psychology get published without any apparent need for the researchers to explain their findings in any meaningful sense; instead, they usually just restate and label their findings, or they posit some biologically-implausible function for what they found. Without the serious and consistent application of evolutionary theory to psychological research, implausible effects will continue to be published and subsequently fail to replicate because there’s otherwise little way to tell whether a finding makes sense. By contrast, I find it plausible that unlikely effects can be more plainly spotted – by reviewers, readers, and replicators – if they are all couched within the same theoretical framework; even better, the problems in design can be more easily identified and rectified by considering the underlying functional logic, leading to productive future research.  

References: Kunert, R. (2016). Internal conceptual replications do not increase independent replication success. Psychological Bulletin Review, DOI 10.3758/s13423-016-1030-9

When Intuitions Meet Reality

Let’s talk research ethics for a moment.

Would you rather have someone actually take $20 from your payment for taking part in a research project, or would you rather be told – incorrectly – that someone had taken $20, only to later (almost immediately, in fact) find out that your money is safely intact and that the other person who supposedly took it doesn’t actually exist? I have no data on that question, but I suspect most people would prefer the second option; after all, not losing money tends to be preferable to losing money, and the lie is relatively benign. To use a pop culture example, Jimmy Kimmel has aired a segment where parents lie to their children about having eaten all their Halloween candy. The children are naturally upset for a moment and their reactions are captured so people can laugh at them, only to later have their candy returned and the lie exposed (I would hope). Would it be more ethical, then, for parents to actually eat their children’s candy so as to avoid lying to their children? Would children prefer that outcome?

“I wasn’t actually going to eat your candy, but I wanted to be ethical”

I happen to think that answer is, “no; it’s better to lie about eating the candy than to actually do it” if you are primarily looking out for the children’s welfare (there is obviously the argument to be made that it’s neither OK to eat the candy or to lie about it, but that’s a separate discussion). That sounds simple enough, but according to some arguments I have heard, it is unethical to design research that, basically, mimics the lying outcome. The costs being suffered by participants need to be real in order for research on suffering costs to be ethically acceptable. Well, sort of; more precisely, what I’ve been told is that it’s OK to lie to my subjects (deceive them) about little matters, but only in the context of using participants drawn from undergraduate research pools. By contrast, it’s wrong for me to deceive participants I’ve recruited from online crowd-sourcing sites, like Mturk. Why is that the case? Because, as the logic continues, many researchers rely on MTurk for their participants, and my deception is bad for those researchers because it means participants may not take future research seriously. If I lied to them, perhaps other researchers would too, and I have poisoned the well, so to speak. In comparison, lying to undergraduates is acceptable because, once I’m done with them, they probably won’t be taking part in many future experiments, so their trust in future research is less relevant (at least they won’t take part in many research projects once they get out of the introductory courses that require them to do so. Forcing undergraduates to take part in research for the sake of their grade is, of course, perfectly ethical).

This scenario, it seems, creates a rather interesting ethical tension. What I think is happening here is that a conflict has been created between looking out for the welfare of research participants (in common research pools; not undergraduates) and looking out for the welfare of researchers. On the one hand, it’s probably better for participants’ welfare to briefly think they lost money, rather than to let them actually lose money; at least I’m fairly confident that is the option subjects would select if given the choice. On the other hand, it’s better for researchers if those participants actually lose money, rather than briefly hold the false believe that they did, so participants continue to take their other projects seriously. An ethical dilemma indeed, balancing the interests of the participants against those of the researchers.

I am sympathetic to the concerns here; don’t get me wrong. I find it plausible to suggest that if, say, 80% of researchers outright deceived their participants about something important, people taking this kind of research over and over again would likely come to assume some parts of it were unlikely to be true. Would this affect the answers participants provide to these surveys in any consistent manner? Possibly, but I can’t say with any confidence if or how it would. There also seems to be workarounds for this poisoning-the-well problem; perhaps honest researchers could write in big, bold letters, “the following research does not contain the use of deception” and research that did use deception would be prohibited from attaching that bit by the various institutional review boards that need to approve these projects. Barring the use of deception across the board would, of course, create its own set of problems too. For instance, many participants taking part in research are likely curious as to what the goals of the project are. If researchers were required to be honest and transparent about their purposes upfront so as to allow their participants to make informed decisions regarding their desire to participate (e.g., “I am studying X…”), this can lead to all sorts of interesting results being due to demand characteristics - where participants behave in unusual manners as a result of their knowledge about the purpose of the experiment – rather than the natural responses of the subjects to the experimental materials. One could argue (and many have) that not telling participants about the real purpose of the study is fine, since it’s not a lie as much as an omission. Other consequences of barring explicitly deception exist as well, though, including the lack of control over experimental stimuli during interactions between participants and the inability to feasibly even test some hypotheses (such as whether people prefer the tastes of identical foods, contingent on whether they’re labeled in non-identical ways).

Something tells me this one might be a knock off

Now this debate is all well and good to have in the abstract sense, but it’s important to bring some evidence to the matter if you want to move the discussion forward. After all, it’s not terribly difficult for people to come up with plausible-sounding, but ultimately incorrect, lines of reasoning as for why some research practice is possibly (un)ethical. For example, some review boards have raised concerns about psychologists asking people to take surveys on “sensitive topics”, under the fear that answering questions about things like sexual histories might send students into an abyss of anxiety. As it turns out, such concerns were ultimately empirically unfounded, but that does not always prevent them from holding up otherwise interesting or valuable research. So let’s take a quick break from thinking about how deception might be harmful in the abstract to see what effects it has (or doesn’t have) empirically.

Drawn by the debate between economists (who tend to think deception is bad) and social scientists (who tend to think it’s fine), Barrera & Simpson (2012) conducted two experiments to examine how deceiving participants affected their future behavior. The first of these studies tested the direct effects of deception: did deceiving a participant make them behave differently in a subsequent experiment? In this study, participants were recruited as part of a two-phase experiment from introductory undergraduate courses (so as to minimize their previous exposure to research deception, the story goes; it just so happens they’re likely also the easiest sample to get). In the first phase of this experiment, 150 participants played a prisoner’s dilemma game which involved cooperating with or defecting on another player; a decision which would affect both player’s payments. Once the decisions had been made, half the participants were told (correctly) that they had been interacting with another real person in the other room; the other half were told they had been deceived, and that no other player was actually present. Everyone was paid and sent home.

Two to three weeks later, 140 of these participants returned for phase two. Here, they played 4 rounds of similar economic games: two rounds of dictator-games and two rounds of trust-games. In the dictator games, subjects could divide $20 between themselves and their partner; in the trust games, subjects could send some amount of $10 to the other player, this amount would be multiplied by three, and that player could then keep it all or send some of it back. The question of interest, then, is whether the previously-deceived subjects would behave any differently, contingent on their doubts as to whether they were being deceived again. The thinking here is that if you don’t believe you’re interacting with another real person, then you might as well be more selfish than you otherwise would. The results showed that while the previously-deceived participants were more likely to believe that social science researchers used deception somewhat more regularly, relative to the non-deceived participants their behavior was actually no different. Not only were the amounts of money sent to others no different (participants gave $5.75 on average in the dictator condition and trusted $3.29 when they were not previously deceived, and gave $5.52 and trusted $3.92 when they had been), but the behavior was no more erratic either. The deceived participants behaved just like the non-deceived ones.

In the second study the indirect effects of deception were examined. One-hundred-six participants first completed the same dictator and trust games as above. They were then either assigned to read about an experiment that did or did not make use of deception; a deception which included the simulation of non-existent participants. They then played another round of dictator and trust games immediately afterwards to see if their behavior would differ, contingent on knowing about how researchers might be deceive them. As in the first study, no behavioral differences emerged. Neither directly deceiving participants about the presence of others in the experiment or providing them with information that deception does take place in such research seemed to have any noticeable effects on subsequent behavior.

“Fool me once, shame on me; Fool me twice? Sure, go ahead”

Now it is possible that the lack of any effect in the present research had to do with the fact that participants were only deceived once. It is certainly possible that repeated exposures to deception, if frequent enough, will begin to have an effect and that effect will be a lasting one and it will not just be limited to the researcher employing the deception. In essence, it is possible that some spillover between experimenters over time might occur. However, this is something that needs to be demonstrated; not just assumed. Ironically, as Barrera & Simpson (2012) note, demonstrating such a spillover effect can be difficult in some instances, as designing non-deceptive control conditions to test against the deceptive ones is not always a straightforward task. In other words, as I mentioned before, some research is quite difficult – if not impossible – to conduct without being able to use deception. Accordingly, some control conditions might require that you deceive participants about deceiving them, which is awfully meta. Barrera & Simpson (2012) also mention some research findings that report even when no deception is used, participants who repeatedly take part in these kinds of economic experiments tend to get less cooperative over time. If that finding holds true, then the effects of repeated deception need to be filtered out from the effects of repeated participation in general. In any case, there does not appear to any good evidence that minor deceptions are doing harm to participants or other researchers. They might still be doing harm, but I’d like to see it demonstrated before I accept that they do. 

References: Barrera, D. & Simpson, B. (2012). Much ado about deception: Consequences of deceiving research participants in the social sciences. Sociological Methods & Research, 41, 383-413.

Health Food Nazis

“Hitler was a vegetarian. Just goes to show, vegetarianism, not always a good thing. Can, in some extreme cases, lead to genocide.” – Bill Bailey

There’s a burgeoning new field of research in psychology known as health licensing*. Health licensing is the idea that once people do something health-promoting, they subsequently give themselves psychological license to do other, unhealthy things. A classic example of this kind of research might go something like this: an experimenter will give participants a chance to do something healthy, like go on a jog or eat a nutritious lunch. After participants engage in this healthy behavior, they are then given a chance to do something unhealthy, like break their own legs. Typical results show that once people have engaged in these otherwise healthy behaviors, they are significantly more likely to engage in self-destructive ones, like leg-breaking, in order to achieve a balance between their healthy and unhealthy behaviors. This is just one more cognitive quirk to add to the ever-lengthening list of human psychological foibles.

Now that you engaged in hospital-visiting behavior, feel free to burn yourself to even it out.

Now many of you are probably thinking one or both of two things: “that sounds strange” and “that’s not true”. If you are thinking those things, I’m happy that we’re on the same page so far. The problems with the above hypothetical area of research are clear. First, it seems strange that people would go do something unhealthy and harmful because they had previously done something which was good for them; it’s not like healthy and unhealthy behaviors need to be intrinsically balanced out for any reason, at least not one that readily comes to mind. Second, it seems strange that people would want to engage in the harmful behaviors at all. Just because an option to do something unhealthy is presented, it doesn’t mean people are going to want to take it, as it might have little appeal to them. When people typically engage in behaviors which are deemed harmful in the long-term – such as smoking, overeating junk food, or other such acts which are said to be psychologically ‘licensed’ by healthy behaviors – they do so because of the perceived short-term benefits of such things. People certainly don’t drink for the hangover; they drink for the pleasant feelings induced by the booze.

So, with that in mind, what are we to make of a study that suggests doing something healthy can give people a psychological license to adopt immoral political stances? In case that sounds too abstract, the research on the table today examines whether drinking sauerkraut juice make people more likely to endorse Nazi-like politics, and no; I’m not kidding (as much as I wish I was). The paper (Messner & Brugger, 2015) itself leans heavily on moral licensing: the idea that engaging in moral behaviors activates compensating psychological mechanisms that encourage the actor to engage in immoral ones. So, if you told the truth today, you get to lie tomorrow to balance things out. Before moving further into the details of the paper, it’s worth mentioning that the authors have already bumped up against one of the problems from my initial example: I cannot think of a reason that ‘moral’ and ‘immoral’ behaviors need to be “balanced out” psychologically (whatever that even means), and none is provided. Indeed, as some people continuously refrain from immoral (or unhealthy) behaviors, whereas others continuously indulge in them, compensation or balance doesn’t seem to factor into the equation in the same way (or at all) for everyone.

Messner & Brugger (2015) try to draw on a banking analogy, whereby moral behavior gives one “credit” into their account that can be “spent” on immoral behavior. However, this analogy is largely unhelpful as you cannot spend money you do not have, but you can engage in immoral behaviors even if you have no morally-good “credit”. It’s also unhelpful in that it presumes immoral behavior is something one wants to spend their moral credit on; the type of immoral behavior seems to be besides the point, as we will soon see. Much like my leg-breaking example, this too seems to make little sense: people don’t seem to want to engage in immoral behavior because it is immoral. As the bank account analogy is not at all helpful for understanding the phenomenon in question, it seems better to drop it altogether, since it’s only likely to sow confusion in the minds of anyone trying to really figure out what’s going on here. Then again, perhaps the confusion is only present in the paper to compensate for all the useful understanding the researchers are going to provide us later.

“We broke half the lights to compensate for the fact that the other half work”

Moving forward, the authors argue that, because health-relevant behavior is moralized, engaging in some kind of health-promoting behavior – in this case, drinking sauerkraut juice (high in fiber and vitamin C, we are told) – ought to give people good moral “credit” which they will subsequently spend on immoral behavior (in much the same way buying eco-friendly products leads to people giving themselves a moral license to steal, we are also told). Accordingly, the authors first asked 128 Swiss students to indicate who was more moral: someone who drinks sauerkraut juice or someone who drinks Nestea. As predicted, 78% agreed that the sauerkraut-juice drinker was more moral, though whether a “neither, and this question is silly” option existed is not mentioned. The students also indicated how morally acceptable and right wing a number of attitudes were; statements which related to, according to the authors, a number of nasty topics like devaluing the culture of others (i.e., seeing a woman wearing a burka making someone uncomfortable), devaluing other nations (viewing foreign nationals as a burden on the state), affirming antisemitism (disliking some aspects of Israeli politics), devaluing the humanity of others (not agreeing that all public buildings ought to be modified for handicapped access), and a few others. Now all of these statements were rated as immoral by the students, but whether they represent what the authors think they do (Nazi-like politics) is up for interpretation.

In any case, another 111 participants were then collected and assigned to drink sauerkraut juice, Nestea, or nothing. Those who drank the sauerkraut juice rated it as healthier than those who drank the Nestea and, correspondingly, were also more likely to endorse the Nazi-like statements (M = 4.46 on a 10-point scale) than those who drank Nestea (M = 3.82) or nothing (M = 3.73). Neat. There are, however, a few other major issues to address. The first of these is that, depending on who you sample, you’re going to get different answers to the “are these attitudes morally acceptable?” questions. Since it’s Swiss students being assessed in both cases, I’ll let that issue slide for the more pressing, theoretical one: the author’s interpretation of the results would imply that the students who indicated that such attitudes are immoral also wished to express them. That is to say, because they just did something healthy (drank sauerkraut juice) they now want to engage in immoral behavior. They don’t seem to picky about what immoral behavior they engage in either, as they’re apparently more willing to adopt political stances they would otherwise oppose, were it not for the disgusting, yet healthy, sauerkraut juice.

This strikes me very much as the kind of metaphorical leg-breaking I mentioned earlier. When people engage in immoral (or unhealthy) behaviors, they typically do so because of some associated benefit: stealing grants you access to resources you otherwise wouldn’t obtain; eating that Twinkie gives you the pleasant taste and the quick burst of calories, even if they make you fat when you do that too much. What benefits are being obtained by the Swiss students who are now (slightly) more likely to endorse right-wing, Nazi-like politics? None are made clear in the paper and I’m having a hard time thinking up any myself. This seems to be a case of immoral behavior for the sake of it, which could only arise from a rather strange psychology. Perhaps there is something worth noting going on here that isn’t being highlighted well; perhaps the authors just stumbled on a statistical fluke (which does happen regularly). In either case, the idea of moral licensing doesn’t seem to help us understand what’s happening at all, and the banking metaphors and references to “balancing” and “compensation” seem similarly impotent to move us forward.

“Just give him the money; he eats well, so it’s OK”

The moral licensing idea is even worse than all that, though, as it doesn’t engage with the main adaptive reason people avoid self-beneficial, but immoral behaviors: other people will punish you for them. If I steal from someone else, they or their allies might well take revenge on me; that I assure them of my healthy diet will likely create little to no effective deterrence against the punishment I would soon receive. If that is the case – and I suspect it is – then this self-granted “moral license” would be about as useful as my simply believing that stealing from others isn’t wrong and won’t be punished (which is to say, “not at all”). Any type of moral license needs to be granted by potential condemners in order to be of any practical use in that regard, and the current research does not assess whether that is the case. This limited focus on conscience – rather than condemnation – complete with the suggestion that people are likely to adopt social politics they would otherwise oppose for the sake of achieving some kind of moral balance after drinking 100 ml of gross sauerkraut juice makes for a very strange paper indeed.

References: Messner, C. & Brugger, A. (2015). Nazis by Kraut: A playful application of moral self-licensing. Psychology, 6, http://dx.doi.org/10.4236/psych.2015.69112

*This statement has not been evaluated by the FDA or any such governmental body; the field doesn’t actually exist to the best of my knowledge, but I’ll tell you it does anyway.

 

Real Diversity Means Disagreement

Diversity is one of the big buzzwords of the recent decades. Institutions, both public and private, often take great pains to emphasize their inclusive stances and colorful cast of a staff. I have long found the displays of diversity to be rather queer in one major respect, however: they almost always focus on diversity in the realms of race and gender. The underlying message behind such displays would seem to suggest that men and women, or members of different ethnic groups, are, in some relevant psychological respects, different from one another. What’s strange about that idea is that, as many of the same people might also like to point out, there’s less diversity between those groups than within them, while others are entirely uncomfortable with the claim of sex or racial differences from the start. The ambivalent feelings many people have surrounding such a message were captured well by Principle Skinner on The Simpsons:

It’s the differences…of which…there are none…that make the sameness… exceptional

Regardless of how one feels about such a premise, the fact remains that diversity in race or gender per se is not what people are seeking to maximize in many cases; they’re trying to increase diversity of thought (or, as Maddox put it many years ago: “people who look different must think different because of it; otherwise, why the hell embrace anything? Why not just assume that diversity comes from within, regardless of their skin color, sex, age or religion?“)

Renting that wheel chair was a nice touch, but it’s time to get up and return it before we lose the deposit

If diversity in perspective is what most people are after when they talk about seeking diversity, it seems like it would be a reasonable step to assess people’s perspectives directly, rather than trying to use proxies for it, like race and gender (or clothing, or hair styles, or musical tastes, or…). If, for instance, one was hiring a number of people for a job involving problem solving, it’s quite possible for the person doing the hiring to select a group of men and women from different races who all end up thinking about things in pretty much the same way: not only would the hires likely have the same kinds of educational background, but they’d probably also have comparable interests since they applied for the same job. On top of that initial similarity the person doing the hiring might be partial towards those who hold agreeable points of view. After all, why would you hire someone who holds a perspective you don’t agree with? It sounds as if that decision would make work that much more unpleasant during the day-to-day operations of the company, even if it was irrelevant to the work they do.

Speaking of areas in which diversity of thought seem to be lacking in certain respects, an interesting new paper from Duarte et al (2015) puts forth the proposition that social psychology – as a field – isn’t all that politically diverse, and that’s probably something of a problem for research quality. For example, if social psychologists can be said to be a rather politically homogeneous bunch, this could result in particular (and important) questions not being asked as a result of how that answer might pan out for the images of liberals and their political rivals. After all, if the conclusions of psychology research, by some happy coincidence, tend to demonstrate that liberals (and, by extension, the liberal researchers conducting it) happen to have a firm grasp on reality, whereas their more conservative counterparts are hopelessly biased and delusional, all the better for the liberal group’s public image; all the worse for the truth value of psychological research, however, if those results are obtained by only asking about scenarios in which conservatives, but not liberals, are likely to look biased. If some liberal assumptions about what is right or good are shaping their research to point in certain directions, we’re going to end up making a number of unwarranted interpretative conclusions.

The problems could mount further if the research purporting to deliver conclusions counter to certain liberal interests is reviewed with disproportionate amounts of scrutiny, whereas research supporting those interests is given a pass when their methods are equivalent or worse. Indeed, Duarte et al (2015) discuss some good reasons to think this might be the state of affairs in psychology, not least of which is that quite a number of social psychologists will explicitly admit they would discriminate against those who do not share their beliefs. When surveyed about their self-assessed probability of voting either for or against a known conservative job applicant (when both alternatives are equally qualified for the job), about 82% of social psychologists indicated they would be at least a little more likely to vote against the conservative hire, with about 43% indicating a fairly high degree of certainty they would (above the midpoint of the scale). These kinds of attitudes might well dissuade more conservatives from wanting to enter the field, especially given that the liberals likely to discriminate against them outnumber the conservatives by about 10-to-1.

“Don’t worry, buddy; you can take ‘em”

Not to put too fine of a point on it, but if these ratios were discovered elsewhere – say, a 10:1 ratio of men to women in a field, and about half of the men explicitly say they would vote against hiring women – I imagine that many social psychologists would tripping over themselves to try and inject some justice and moral outrage into the mix. Compared with some other explicit racist tendencies (4% of respondents wouldn’t vote for a black presidential candidate), or sexist ones (5% wouldn’t vote for a woman), there’s a bit of a gulf in discrimination. While the way the question is asked is not quite the same, social psychologists might be about as likely to want to vote for the conservative job candidate as Americans are to vote for a Muslim or an atheist if we assumed equivalence (which is is to say “not very”).

It is at least promising, then, to see that the reactions to this paper were fairly universal in at least recognizing that there might be something of a political diversity problem in psychology, both in terms of its existence and possible consequences. There was more disagreement with respect to the cause of this diversity problem and whether including more conservative minds would increase research quality, but that’s to be expected. I – like the authors – am happy enough that even social psychologists, by in large, seem to accept that social psychology is not all that politically diverse and that such a state of affairs is likely – or at least potentially – harmful to research in some respects (yet another example where stereotypes seem to track reality well).

That said, there is another point to which I want to draw attention. As I mentioned initially, seeking diversity for diversity’s sake is a pointless endeavor, and one that is certainly not guaranteed to improve the quality of work produced. This is the case regardless of the criteria on which candidates are selected, be they physical, political, or something else.  For example, psychology departments could strive to hire people from a variety of different cultural or ethnic groups, but unless those new hires are better at doing psychology, this diversity won’t improve their products. Similarly, psychology departments could strive to hire people with degrees in other fields, like computer science, chemistry, and fine arts; that would likely increase the diversity of thought in psychology, but since there are many more ways of doing poor psychology than there are of doing good psychology, this diversity in backgrounds wouldn’t necessarily be desirable.

Say “Hello” to your new collaborators

Put bluntly, I wouldn’t want people to begin hiring those from non-liberal groups in greater numbers and believe this will, de facto, improve the quality of their research. More specifically, while greater political diversity might, to some extent, reduce the number of bad research projects by diluting or checking existing liberal biases, I don’t know that it would increase in the number of good papers substantially; the relative numbers might change, but I’m more concerned with the absolutes, as a field which fails to produce quality research in sufficient quantities is not demonstrating much value (just like how the guy without a particular failing doesn’t necessarily offer much as a dating prospect). In my humble (and no doubt biased, but not necessarily incorrect) view, there is an important dimension of thought along which I do not wish psychologists to differ, and that is in their application of evolutionary theory as a guiding foundation for their work. Evolutionary theory not only allows one to find previously unappreciated aspects of psychological functioning by considerations of adaptive value, but also allows for building on previous research in a meaningful way and for the effective rooting out of problematic underlying assumptions. In that sense, even failed research projects can contribute in a more meaningful way when framed in an evolutionary perspective, relative to failed projects lacking one.

Evolutionary theory is by no means a cure-all for the bias problem; people will still sometimes get caught up trying to rationalize behaviors or preferences they morally approve of – like homosexuality – as adaptive, for example. In spite of that, I do not particularly hope to see a diversity of perspectives in psychology regarding the theoretical language we all ought to speak by this point. There are many more ways to think about psychology unproductively than there are of doing it well, and more diversity in those respects will make for a much weaker science.

References: Duarte , J., Crawford, J., Stern, C., Haidt, J., Jussim, L., & Tetlock, P. (2015). Political diversity will improve social psychological science. Behavioral & Brain Sciences, 38, 1-58.

Replicating Failures To Replicate

There are moments from my education that have stuck with me over time. One such moment involved a professor teaching his class about what might be considered a “classic” paper in social psychology. I happened to have been aware of this particular paper for two reasons: first, it was a consistent feature in many of my previous psychology classes and, second, because the news had broke recently that when people tried to replicate the effect they had failed to find it. Now a failure to replicate does not necessarily mean that the findings of the original study were a fluke or the result of experimental demand characteristics (I happen to think they are), but that’s not even why this moment in my education stood out to me. What made this moment stand out is that when I emailed the professor after class to let him know the finding had recently failed to replicate, his response was that he was aware of the failure. This seemed somewhat peculiar to me; if he knew the study had failed to replicate, why didn’t he at least mention that to his students? It seems like rather important information for the students to have and, frankly, a responsibility of the person teaching the material, since ignorance was no excuse in this case.

“It was true when I was an undergrad, and that’s how it will remain in my class”

Stories of failures to replicate have been making the rounds again lately, thanks to a massive effort on the part of hundreds of researchers to try and replicate 100 published effects in three psychology journals. These researchers worked with the original authors, used the original materials, were open about their methods, pre-registered their analyses, and archived all their data. Of these 100 published papers, 97 of them reported their effect as being statistically significant, with the other 3 being right on the borderline of significance and interpreted as being a positive effect. Now there is debate over the value of using these kinds of statistical tests in the first place, but, when the researchers tried to replicate these 100 effects using the statistically significant criterion, only 37 even managed to cross the barrier (given that 89 were expected to replicate if the effects were real, 37 is falling quite short of that goal).

There are other ways to assess these replications, though. One method is to examine the differences in effect size. The 100 original papers reported an average effect size of about 0.4; the attempted replications saw this average drop to about 0.2. A full 82% of the original papers showed a stronger effect size than the attempted replications, While there was a positive correlation (about r = 0.5) between the two – the stronger the original effect, the stronger the replication effect tended to be – this still represents an important decrease in the estimated size of these effects, in addition to their statistical existence. Another method of measuring replication success – unreliable as it might be – is to get the researcher’s subjective opinions about whether the results seemed to replicate. On that front, the researchers felt about 39 of the original 100 findings replicated; quite in line with the above statistical data. Finally, perhaps worth noting, social psychology research tended replicate less often than cognitive research (25% and 50%, respectively), and interaction effects replicated less often than simple effects (22% and 47%, respectively).

The scope of the problem may be a bit larger than that, however. In this case, the 100 papers upon which replication efforts were undertaken were drawn from three of the top journals in psychology. Assuming a positive correlation exists between journal quality (as measured by impact factor) and the quality of research they publish, the failures to replicate here should, in fact, be an underestimate of the actual replication issue across the whole field. If over 60% of papers failing to replicate is putting the problem a bit mildly, there’s likely quite a bit to be concerned about when it comes to psychology research. Noting the problem is only one step in the process towards correction, though; if we want to do something about it, we’re going to need to know why it happens.

So come join in my armchair for some speculation

There are some problems people already suspect as being important culprits. First, there are biases in the publication process itself. One such problem is that journals seem to overwhelmingly prefer to report positive findings; very few people want to read about a bad experiment which didn’t work out well. A related problem, however, is that many journals like to publish surprising, or counter-intuitive findings. Again, this can be attributed to the idea that people don’t want to read about things they already believe are true: most people perceive the sky as blue and research confirming this intuition won’t make many waves. However, I would also reckon that counter-intuitive findings are surprising to people precisely because they are also more likely to be inaccurate descriptions of reality. If that’s the case, than a preference on the part of journal editors for publishing positive, counter-intuitive findings might set them up to publish a lot of statistical flukes.

There’s also the problem I’ve written about before, concerning what are known as “research degrees of freedom“; more colloquially, we might consider this a form of data manipulation. In cases like these, researchers are looking for positive effects, so they test 20 people in each group and peak at the data. If they find an effect, they stop and publish it; if they don’t, they add a few more people and peak again, continuing until they find what they want or run out of resources. They might also split the data up into various groups and permutations until they find a set of data that “works”, so to speak (break it down by male/female, or high/medium/low, etc). While they are not directly faking the data (though some researchers do that as well), they are being rather selective about how they analyze it. Such methods inflate the possibility of finding of effect through statistical brute force, even if the effect doesn’t actually exist.

This problem is not unique to psychology, either. A recent paper by Kaplan & Irvin (2015) examined research from 1970-2012 that was looking at the effectiveness of various drugs and dietary supplements for preventing or treating cardiovascular disease. There were 55 trials that met the author’s inclusion criteria. What’s important to note about these trials is that, prior to the year 2000, none of the papers were pre-registered with respect to what variables they were interested in assessing; after 2000, every such study was pre-registered. Registering this research is important, as it doesn’t allow the researchers to then conduct a selective set of analyses on their data. Sure enough, prior to 2000, 57% of trials reported statistically-significant effects; after 2000, that number dropped to 8%. Indeed, about half the papers published after 2000 did report some statistically significant effects, but only for variables other than the primary outcomes they registered. While this finding is not necessarily a failure to replicate per se, it certainly does make one wonder about the reliability of those non-registered findings.

And some of those trials were studying death as an outcome, so that’s not good…

There is one last problem I would like to mention; one I’ve been beating the drum for for the past several years. Assuming that pre-registering research in psychology would help weed out false positives (it likely would), we would still be faced with the problem that most psychology research would not find anything of value, if the above data are any indication. In the most polite way possible, this would lead me to ask a question along the lines of, “why are so many psychology researchers bad at generating good hypotheses?” A pre-registered bad idea does not suddenly make it a good one, even if it makes data analysis a little less problematic. This leads me to my suggestion for improving research in psychology: the requirement of actual theory for guiding research. In psychology, most theories are not theories, but rather restatements of a finding. However, when psychologists begin to take an evolutionary approach to their work, the quality of research (in my obviously-biased mind) tends to improve dramatically. Even if the theory is wrong, making it explicit allows problems to be more easily discussed, discovered, and corrected (provided, of course, that one understands how to evaluate and test such theories, which many people unfortunately do not). Without guiding/foundational theories, the only thing you’re left with when it comes to generating hypotheses are the existing data and your intuitions which, again, don’t seem to be good guides for conducting quality research.

References: Kaplan, R. & Irvin, V. (2015). Likelihood of null effects of large NHLBI clinical trials has increased over time. PLoS One, doi:10.1371/journal.pone.013238

Stereotyping Stereotypes

I’ve attended a number of talks on stereotypes; I’ve read many more papers in which the word was used; I’ve seen still more instances where the term has been used outside of academic settings in discussions or articles. Though I have no data on hand, I would wager that the weight of this academic and non-academic literature leans heavily towards the idea that stereotypes are, by in large, inaccurate. In fact, I would go a bit farther than that: the notion that stereotypes are inaccurate seems to be so common that people often see little need in ensuring any checks were put into place to test for their accuracy in the first place. Indeed, one of my major complaints about the talks on stereotypes I’ve attended is just that: speakers never mentioning the possibility that people’s beliefs about other groups happen to, on the whole, match up to reality fairly well in many cases (sometimes they have mentioned this point as an afterthought but, from what I’ve seen, that rarely translates into later going out and testing for accuracy). To use a non-controversial example, I expect that many people believe men are taller than women, on average, because men do, in fact, happen to be taller.

Pictured above: not a perceptual bias or an illusory correlation

This naturally raises the question of how accurate stereotypes – when defined as beliefs about social groups – tend to be. It should go without saying that there will not be a single answer to that question: accuracy is not an either/or type of matter. If I happen to think it’s about 75 degrees out when the temperature is actually 80, I’m more accurate in my belief than if the temperature was 90. Similarly, the degree of that accuracy should be expected to vary on the intended nature of the stereotype in question; a matter to which I’ll return later. That said, as I mentioned before, quite a bit of the exposure I’ve had to the subject of stereotypes suggests rather strongly and frequently that they’re inaccurate. Much of the writing about stereotypes I’ve encountered focuses on notions like “tearing them down”, “busting myths”, or about how people are unfairly discriminated against because of them; comparatively little of that work has focused on instances in which they’re accurate which, one would think, would represent the first step in attempting to understand them.

According to some research reviewed by Jussim et al (2009), however, that latter point is rather unfortunate, as stereotypes often seem to be quite accurate, at least by the standards set by other research in psychology. In order to test for the accuracy of stereotypes, Jussim et al (2009) report on some empirical studies that met two key criteria: first, the research had to compare people’s beliefs about a group to what that group was actually like; that much is a fairly basic requirement. Second, the research had to use an appropriate sample to determine what that group was actually like. For example, if someone was interested in people’s beliefs about some difference between men and women in general, but only tested these beliefs against data from a convenience sample (like men and women attending the local college), this could pose something of a problem to the extent that the convenience sample differs from the reference group of people holding the stereotypes. If people, by in large, have accurate stereotypes, researchers would never know if they make use of a non-represented reference group.

Within the realm of racial stereotypes, Jussim et al (2009) summarized the results of 4 papers that met this criteria. The majority of the results fell within what the authors consider “accurate” range (as defined by being 0-10% off from the criteria values) or near-misses (those between 10-20% off). Indeed, the average correlations between the stereotypes and criteria measures ranged from .53 to .93, which are very high, relative to the average correlation uncovered by psychological research. Even the personal stereotypes, while not as high, were appreciably accurate, ranging from .36 to .69. Further, while people weren’t perfectly accurate in their beliefs, those who overestimated differences between racial groups tended be balanced out by those who underestimated those differences in most instances. Interestingly enough, people’s stereotypes about group differences tended to be a bit more accurate than their within group stereotypes.

“Ha! Look at all that inaccurate shooting. Didn’t even come close”

The same procedure was used to review research on gender stereotypes as well, yielding 7 papers with larger sample sizes. A similar set of results emerged: the average stereotype was rather accurate, with correlations ranging between .34 to .98, most of which hovered in the range of .7. Individual stereotypes were again less accurate, but most were still heading in the right direction. To put those numbers in perspective, Jussim et al (2009) summarized a meta-analyses examining the average correlation found in psychological research. According to that data, only 24% of social psychology effects represent correlations larger than .3 and a mere 5% exceeded a correlation of .5; the corresponding numbers for averaged stereotypes were 100% of the reviewed work meeting the .3 threshold, and about 89% of the correlations exceeding the .5 threshold (personal stereotypes at 81% and 36%, respectively).

Now neither Jussim et al (2009) or I would claim that all stereotypes are accurate (or at least reasonably close); no one I’m aware of has. This brings us to the matter of when we should expect stereotypes to be accurate and when we should expect them to fall shorter of that point. As an initial note, we should always expect some degree of inaccuracy in stereotypes – indeed, in all beliefs about the world – to the extent that gathering information takes time and improving accuracy is not always worth that investment in the adaptive sense. To use a non-biological example, spending an extra three hours studying to improve one’s grade on a test from a 70 to a 90 might seem worth it, but the same amount of time used to improve from a 90 to a 92 might not. Similarly, if one lacks access to reliable information about the behavior of others in the first place, stereotypes should also tend to be relatively inaccurate. For this reason, Jussim et al (2009) note that cross-cultural stereotypes in national personalities tend to be among the most inaccurate, as people from, say, India, might have relatively little exposure to information about people from South Africa, and vice versa.

The second point to make on accuracy is that, to the extent that beliefs guide behavior and that behavior carries costs or benefits, we should expect beliefs to tend towards accuracy (again, regardless of whether they’re about social groups or the world more generally). If you believe, incorrectly, that group A is as likely to assault you as group B (the example that Jussim et al (2009) use involves biker gang members and ballerinas), you’ll either end up avoiding one group more than you need to, not being wary enough around one, or miss in both directions, all of which involves social and physical costs. One of the only cases in which being wrong might reliably carry benefits are contexts in which one’s inaccurate beliefs modifies the behavior of other people. In other words, stereotypes can be expected to be inaccurate in the realm of persuasion. Jussim et al (2009) make nods toward this possibility, noting that political stereotypes are among the least accurate ones out there, and that certain stereotypes might have been crafted specifically with the intent of maligning a particular group.

For instance…

While I do suspect that some stereotypes exist specifically to malign a particular group, that possibility does raise another interesting question: namely, why would anyone, let alone large groups of people, be persuaded to accept inaccurate stereotypes? For the same reason that people should prefer accurate information over inaccurate information when guiding their own behaviors, they should also be relatively resistant to adopting stereotypes which are inaccurate, just as they should be when it comes to applying them to individuals when they don’t fit. To the extent that a stereotype is of this sort (inaccurate), then, we should expect that it not be widely held, except in a few particular contexts.

Indeed, Jussim et al (2009) also review evidence that suggests people do not inflexibly make use of stereotypes, preferring individuating information when it’s available: according to the meta-analyses reviewed, the average influence of stereotypes on judgments hangs around r = .1 (which does not, in many instances, have anything to say about the accuracy of the stereotype; just the extent of its effect); by contrast, individuating information had an average effect of about .7 which, again, is much larger than the average psychology effect. Once individuating information is controlled for, stereotypes tend to have next to zero impact on people’s judgments of others. People appear to rely on personal information to a much higher degree than stereotypes, and often jettison ill-fitting stereotypes in favor of personal information. In other words, the knowledge that men tend to be taller than women does not have much of an influence on whether I think a particular women is taller than a particular man.

When should we expect that people will make the greatest use of stereotypes, then? Likely when they have access to the least amount of individuating information. This has been the case in a lot of the previous research on gender bias where very little information is provided about the target individual beyond their sex (see here for an example). In these cases, stereotypes represent an individual doing the best they can with limited information. In some cases, however, people express moral opposition to making use of that limited information, contingent on the group(s) it benefits or disadvantages. It is in such cases that, ironically, stereotypes might be stereotyped as inaccurate (or at least insufficiently accurate) to the greatest degree.

References: Jussim, L., Cain, T., Crawford, J., Harber, K., & Cohen, F. (2009). The unbearable accuracy of stereotypes. In Nelson, T. The Handbook of Prejudice, Stereotyping, and Discrimination (199-227). NY: Psychological Press.  

A Curious Case Of Welfare Considerations In Morality

There was a stage in my life, several years back, where I was a bit of a chronic internet debater. As anyone who has engaged in such debates – online or off, for that matter – can attest to, progress can be quite slow if any is observed at all. Owing to the snail’s pace of such disputes, I found myself investing more time in them than I probably should have. In order to free up my time while still allowing me to express my thoughts, I created my own site (this one) where I could write about topics that interested me, express my view points, and then be done with them, freeing me from the quagmire of debate. Happily, this is a tactic that has not only proven to be effective, but I like to think that it has produced some positive externalities for my readers in the form of several years worth of posts that, I am told, some people enjoy. Occasionally, however, I do still wander back into a debate here and there, since I find them fun and engaging. Sharing ideas and trading intellectual blows is nice recreation.

 My other hobbies follow a similar theme

In the wake of the recent shooting in Charleston, the debate I found myself engaged in concerned the arguments for the moral and legal removal guns from polite society, and I wanted to write a bit about it here, serving both the purposes of cleansing it from my mind and, hopefully, making an interesting point about our moral psychology in the process. The discussion itself centered around a clip from one of my favorite comedians, Jim Jefferies, who happens to not be a fan of guns himself. While I recommend watching the full clip and associated stand-up because Jim is a funny man, for those not interested in investing the time and itching to get to the moral controversy, here’s the gist of Jim’s views about guns:

“There’s one argument and one argument alone for having a gun, and this is the argument: Fuck off; I like guns”

While Jim notes that there’s nothing wrong with saying, “I like something; don’t take it away from me”, the rest of the routine goes through various discussions of how other arguments for the owning of guns are, in Jim’s word’s, bullshit (including owning guns for self-defense or the overthrow of an oppressive government. For a different comedic perspective, see Bill Burr).

Laying my cards on the table, I happen to be one of those people who enjoys shooting recreationally (just target practice; I don’t get fancy with it and I have no interest in hunting). That said, I’m not writing today to argue with any of Jim’s points; in fact, I’m quite sympathetic to many of the concerns and comments he makes: on the whole, I feel the expected value of guns, in general, to be a net cost for society. I further feel that if guns were voluntarily abandoned by the population, there would probably be many aggregate welfare benefits, including reduced rates of suicide, homicide, and accidental injury (owing to the possibility that many such conflicts are heat of the moment issues, and lacking the momentary ability to employ deadly force might mean it’s never used at all later). I’m even going to grant his point I quoted above: the best justification for owning a gun is recreational in nature. I don’t ask that you agree or disagree with all this; just that you follow the logical form of what’s to come.

Taking all of that together, the argument for enacting some kind of legal ban of guns – or at the very least the moral condemnation of the ability to own them – goes something like this: because the only real benefit to having a gun is that you get to have some fun with it, and because the expected costs to all those guns being around tend to be quite high, we ought to do away with the guns. The welfare balance just shifts away from having lots of deadly weapons around. Jim even notes that while most gun owners will never use their weapons intentionally or accidentally to inflict costs on others or themselves, the law nevertheless needs to cater to the 1% or so of people who would do such things. So, this thing – X – generates welfare costs for others which far outstrip its welfare benefits, and therefore should be removed. The important point of this argument, then, would seem to focus on these welfare concerns.

Coincidentally, owning a gun may make people put a greater emphasis on your concerns

The interesting portion of this debate is that the logical form of the argument can be applied to many other topics, yet it will not carry the same moral weight; a point I tried to make over the course of the discussion with a very limited degree of success. Ideas die one person at a time, the saying goes, and this debate did not carry on to the point of anyone losing their life.

In the case, we can try and apply the above logic to the very legal, condoned, and often celebrated topic of alcohol. On the whole, I would expect that the availability of alcohol is a net cost for society: drunk driving deaths in the US yield about 10,000 bodies (a comparable number to homicides committed with a firearm), which directly inflict costs on non-drinkers. While it’s more difficult to put numbers on other costs, there are a few non-trivial matters to consider, such as the number of suicides, assaults, and non-traffic accidents encouraged by the use of alcohol, the number of unintended pregnancies and STIs spread through more casual and risky drunk sex, as well as the number of alcohol-related illnesses and liver damage. Broken homes, abused and neglected children, spirals of poverty, infidelity, and missed work could also factor into these calculations somewhere. Both of these products – guns and booze – tend to inflict costs on individuals other than the actor when they’re available, and these costs appear to be substantial,

So, in the face of all those costs, what’s the argument in favor of alcohol being approved of, legally or moally? Well, the best and most common argument seems to be, as Jim might say, “Fuck off; I like drinking”. Now, of course, there are some notable differences between drinking and owning guns, mainly being that people don’t often drink to inflict costs on others while many people do use guns to intentionally do harm. While the point is well taken, it’s worth bearing in mind that the arguments against guns are not the same arguments against murder. The argument as it pertains to guns seemed to be, as I noted above, that regular people should not be allowed to own guns because some small portion of the population that does have one around will do something reprehensible or stupid with it, and that these concerns trump the ability of the responsible owners to do what they enjoy. Well, presumably, we could say the same thing about booze: even if most people who drink don’t drive while drunk, and even if not all drunk drivers end up killing someone, our morals and laws need to cater to that percentage of people that do.

(As an aside, I spent the past few years at New Mexico State University. One day, while standing outside a classroom in the hall, I noticed a poster about drunk driving. The intended purpose of the flyer seemed to be to inform students that most people don’t drive drunk; in fact, about 75% students reported not driving under the influence, if I recall correctly. That does mean, of course, that about 1 in 4 students did at some point, which is a worrying figure; perhaps enough to make a solid argument for welfare concerns)

There is also the matter of enforcement: making alcohol illegal didn’t work out well in the past; making guns illegal could arguably be more successful on a logistical level. While such a point is worth thinking about, it is also a bit of a red herring from the heart of the issue: that is, most people are not opposed to the banning of alcohol because it’s difficult in practice, but otherwise supportive of the measure on principle; instead, people seem as if they would oppose the idea even if it could be implemented efficiently. People’s moral judgments can be quite independent of enforcement capacity. Computationally, it seems like the judgments concerning whether something is worth condemning in the first place ought to proceed judgments about whether it could be done feasibly, simply because the latter estimation is useless without the former. Spending time thinking about what one could punish effectively without any interest in following through would be like thinking about all the things one could chew and swallow when they’re hungry, even if they wouldn’t want to eat them.

Plenty of fiber…and there’s lots of it….

There are two points to bear in mind from this discussion to try and tie it back to understanding our own moral psychology and making a productive point. The first is that there is some degree of variance in moral judgments that is not being determined by welfare concerns. Just because something ends up resulting in harm to others, people are not necessarily going to be willing to condemn it. We might (not) accept a line of reasoning for condemning a particular act because we have some vested interest in (encouraging) preventing it while categorically (accepting) rejecting that same line in other cases where our strategic interests run in the opposite direction; interests which we might not even be consciously aware of in many cases. This much, I suspect, will come as no surprise to anyone, especially because other people in debates are known for being so clearly biased to you, the dispassionate observer. Strategic interests lead us to preference our own concerns.

The other point worth considering, though, is that people raise or deny these welfare concerns in the interests of being persuasive to others. The welfare of other people appears to have some impact on our moral judgments; if welfare concerns were not used as inputs, it would seem rather strange that so many arguments about morality often lean so heavily and explicitly upon them. I don’t argue that you should accept my moral argument because it’s Sunday, as that fact seems to have little bearing to my moral mechanisms. While this too might seem obvious to people (“of course other people’s suffering matters to me!”), understanding why the welfare of others matters to our moral judgments is a much trickier explanatory issue than understanding why our own welfare matters to us. Both of these are matters that any complete theory of morality needs to deal with.

Privilege And The Nature Of Inequality

Recently, there’s been a new comic floating around my social news feeds claiming that it will forever change the way I think about something. It’s not like there’s ever isn’t such article on my feeds, really, but I decided it would provide me with the opportunity to examine some research I’ve wanted to write about for some time. In the case of this mind-blowing comic, the concept of privilege is explained through a short story. The concept itself is not a hard one to understand: privilege here refers to cases in which an individual goes through their life with certain advantages they did not earn. The comic in question looks at an economic privilege: two children are born, but one has parents with lots of money and social connections. As expected, the one with the privilege ends up doing fairly well for himself, as many burdens of life have been removed, while the one without ends up working a series of low-paying jobs, eventually in service to the privileged one. The privileged individual declares that nothing has ever been handed to him in life as he is literally being handed some food on a silver platter by the underprivileged individual, apparently oblivious to what his parent’s wealth and connections have brought him.

Stupid, rich baby…

In the interests of laying my cards on the table at the outset, I would count myself among those born into privilege. While my family is not rich or well-connected the way people typically think about those things, there haven’t been any necessities of life I have wanted for; I have even had access to many additional luxuries that others have not. Having those burdens removed is something I am quite grateful for, and it has allowed me to invest my time in ways other people could not. I have the hard-work and responsibility of my parents to thank for these advantages. These are not advantages I earned, but they are certainly not advantages which just fell from the sky; if my parents had made different choices, things likely would have worked out differently for me. I want to acknowledge my advantages without downplaying their efforts at all.

That last part raises a rather interesting question that pertains to the privilege debate, however. In the aforementioned comic, the implication seems to be – unless I’m misunderstanding it – that things likely would have turned out equally well for both children if they had been given access to the same advantages in their life. Some of the differences that each child starts with seems to be the results of their parent’s work, while other parts of that difference are the result of happenstance. The comic appears to suggest the differences in that case were just due to chance: both sets of parents love their children, but one set seems to have better jobs. Luck of the draw, I suppose. However, is that the case for life more generally; you know, the thing about which the comic intends to make a point?

For instance, if one set of parents happen to be more short-term oriented – interested in taking rewards now rather than foregoing them for possibly larger rewards in the future, i.e., not really savers – we could expect that their children will, to some extent, inherit those short-term psychological tendencies; they will also inherit a more meager amount of cash. Similarly, the child of the parents who are more long-term focused should inherit their proclivities as well, in addition to the benefits those psychologies eventually accrued.

Provided that happened to be the case, what would become of these two children if they both started life in the same position? Should we expect that they both end up at similar places? Putting the questions another way, let’s imagine that, all the sudden, the wealth of this world was evenly distributed among the population; no one had more or less than anyone else. In this imaginary world, how long would that state of relative equality last? I can’t say for certain, but my expectation is that it wouldn’t last very long at all. While the money might be equally distributed in the population, the psychological predispositions for spending, saving, earning, investing, and so on are unlikely to be. Over time, inequalities will again begin to assert themselves as those psychological differences – be they slight or large – accumulate from decision after decision.

Clearly, this isn an experiment that couldn’t be run in real life – people are quite attached to their money – but there are naturally occurring versions of it in everyday life. If you want to find a context in which people might randomly come into possession of a sum of money, look no further than the lottery. Winning the lottery, both whether one wins at all and how much money you get, are as close to randomly determined as we’re going to get. If the differences between the families in the mind-blowing comic are due to chance factors, we would predict that people who win more money in the lottery should, subsequently, be doing better in life, relative to those who won smaller amounts. By contrast, if chance factors are relatively unimportant, than the amount won should be less important: whether they win large or small amounts, they might spend it (or waste it) at similar rates.

Nothing quite like a dose of privilege to turn your life around

This was precisely what was examined by Hankins et al (2010): the authors sought to assess the relationship between the amount of money won in a lottery and the probability of the winner filing for bankruptcy within a five year period of their win. Rather than removing inequalities and seeing how things shake out, then, this research took the opposite approach: examining a process that generated inequalities and seeing how long it took for them to dissipate.

The primary sample for this research were the Fantasy 5 winners in Florida from April 1993 to November, 2002 who had won $600 or more: approximately 35,000 of them after certain screening measures had been implemented. These lottery winners were grouped into those who won between $10,000 and $50,000, and those who won between $50,000 and $150,000 (subsequent analyses would examine those who won $10,000 or less as well, leading to small, medium, and large winner groups).

Of those 35,000 winners, about 2,000 were linked to a bankruptcy filing within five years of their win, meaning that a little more than 1% of winners were filing each year on average; a rate comparable to the broader Florida population. The first step was to examine whether the large winners were doing comparable amounts of bankruptcy filing prior to their win, relative to the low winners which, thankfully, they were. In pretty much all respects, those who won a lot of money did not differ from those who won less before their win (including race, gender, marital status, educational attainment, and nine other demographic variables). That’s what one would expect from the lottery, after all.

Turning to what happened after their win, within the first two years, those who won larger sums of money were less likely to file for bankruptcy than smaller winners; however, in years 3 through 5 that pattern reversed itself, with larger winners becoming more likely to file. The end result of this shifting pattern was that, in five years time, large winners were equally likely to have filed for bankruptcy, relative to smaller winners. As Hankins et al (2010) put it, large cash payments did not prevent bankruptcy; they only postponed it. This result was consistently obtained after attempting a number of different analyses, suggesting that the finding is fairly robust. In fact, when the winners eventually did file for bankruptcy, the big winners didn’t have much more to show for it than small winners: those who won between $25,000 and $150,000 only had about $8,000 more in assets than those who had won less than $1,500, and the two groups had comparable debts.

Not much of an ROI on making it rain these days, it seems

At least when it came to one of the most severe forms of financial distress, large sums of cash did not appear to stop people from falling back into poverty in the long term, suggesting that there’s more going on in the world than just poor luck and unearned privilege. Whatever this money was being spent on, it did not appear to be sound investments. Maybe people were making more of their luck than they realized.

It should be noted that this natural experiment does pose certain confounds, perhaps the most important of which is that not everyone plays the lottery. In fact, given that the lottery itself is quite a bad investment, we are likely looking at a non-random sample of people who choose to play it in the first place; people who already aren’t prone to making wise, long-term decisions. Perhaps these results would look different if everyone played the lottery but, as it stands, thinking about these results in the context of the initial comic about privilege, I would have to say that my mind remains un-blown. Unsurprisingly, deep truths about social life can be difficult to sum up in a short comic.

References: Hankins, S., Hoekstra, M., & Skiba, P. (2010). The ticket to easy street? The financial consequences of winning the lottery. Vanderbilt Law and Economics Research Paper, 10-12.

Relaxing With Some Silly Research

In psychology, there is a lot of bad research out there by all estimates. The poor quality of this research can be attributed to concerns about ideology-driven research agendas, research bias, demand characteristics, lack of any real theory guiding the research itself, p-hacking, file-drawer effects, failures to replicate, small sample sizes, and reliance on undergraduate samples, among others. Arguably, there is more bad (or at least inaccurate) research than good research floating around as, in principle, there are many more ways of being wrong about the human mind than there are of being right about it (even given our familiarity with it); a problem made worse by the fact that being (or appearing) wrong or reporting null findings does not tend to garner one social status in the world of academia. If many of the incentives reside in finding particular kinds of results – and those kinds are not necessarily accurate – the predictable result is a lot of misleading papers. Determining what parts of the existing psychological literature are an accurate description of human psychology can be something of a burden, however, owing to the obscure nature of some of these issues: it’s not always readily apparent that a paper found a fluke result or that certain shady research practices have been employed. Thankfully, it doesn’t take a lot of effort to see why some particular pieces of psychological research are silly; criticizing that stuff can be as relaxing as a day off at the beach.

Kind of like this, but indoors and with fewer women

The last time I remember coming across some of the research that can easily be recognized as silly was when one brave set of researchers asked if leaning to the left made the Eiffel tower look smaller. The theory behind that initial bit of research is called, I think, number line theory, though I’m not positive on that. Regardless of the name, the gist of the idea seems to be that people - and chickens, apparently - associate smaller numbers with a relative leftwardly direction and larger numbers with a rightwardly one. For humans, such a mental representation might make sense in light of our using certain systems of writing; for nonhumans, this finding would seem to make zero sense. To understand why this finding makes no sense, try and place it within a functional framework by asking (a) why might humans and chickens (and perhaps other animals as well) represent smaller quantities with their left, and (b) why might leaning to the left be expected to bias one’s estimate of size? Personally, I’m coming up with a blank on the answer to those questions, especially because biasing one’s estimate of size on the basis of how one is leaning is unlikely to yield more accurate estimates. A decrease in accuracy seems like that could only carry costs in this case; not benefits. So, at best, we’re left calling those findings a development byproduct for humans and likely a fluke for the chickens. In all likelihood, the human finding is probably a fluke as well.

Thankfully, for the sake of entertainment, silly research is not to be deterred. One of the more recent tests of this number line hypothesis (Anelli et al, 2014) makes an even bolder prediction than the Eiffel tower paper: people will actually get better at performing certain mathematical operations when they’re traveling to the left or the right: specifically, going right will make you better at addition and left better at subtraction. Why? Because smaller numbers are associated with the left? How does that make one better at subtraction? I don’t know and the paper doesn’t really go into that part. On the face of it, this seems like a great example of what I have nicknamed “dire straits thinking”. Named after the band’s song, “money for nothing” this type of thinking leads people to hypothesizing that others can get better (or worse) at tasks without any associated costs. The problem with this kind of thinking is that if people did possess the cognitive capacities to be better at certain tasks, one might wonder why people ever perform worse than they could. This would lead me to pose questions like, “why do I have to be traveling right to be better at addition; why not just be better all the time?” Some kind of trade-offs need to referenced to explain that apparent detriment/bonus to performance, but none ever are in dire straits thinking.

In any case, let’s look at the details of the experiment, which was quite simple. Anelli et al, (2014) had a total of 48 participants walk with an experimenter (one at a time; not all 48 at once). The pair would walk together for 20 seconds in a straight line, at which point the experimenter would call out a three-digit number, tell the participants to add or subtract from it by 3 aloud for 22 seconds, give them a direction to turn (right or left), and tell them to begin. At that point, the participant would turn and start doing the math. Each participant completed four trials: two congruent (right/addition or left/subtraction) and two incongruent (right/subtraction or left/addition). The researchers hoped to uncover a congruency effect, such that more correct calculations would be performed in the congruent, relative to incongruent, trials.

Now put the data into to the “I’m right” program and it’s ready to publish

Indeed, just such an effect was found: when participants were moving in a congruent direction as their mathematical operations, they performed more correct calculations on average (M = 10.1), relative to when they were traveling in an incongruent direction (M = 9.6). However, when this effect was broken down by direction, it turns out that the effect only exists when participants were doing addition (M = 11.1 when going right, 10.2 when going left); there was no difference for subtraction (M = 9.0 and 9.1, respectively). Why was there no effect for subtraction? Well, the authors postulate a number of possibilities – one of which being that perhaps participants needed to be walking backwards – though none of them include the possibility of the addition finding being a statistical fluke. It’s strange how infrequently this possibility is ever mentioned in published work, especially in the face of inconsistent findings.

Now one obvious criticism of this research is that the participants were never traveling right or left; they were walking straight ahead in all cases. Right or left, unlike East or West, depends on perspective. When I am facing my computer, I feel I am facing ahead; when I turn around to walk to the bathroom, I don’t feel like I’m walking behind me. The current research would thus rely on the effects of a momentary turn affecting participant’s math abilities for about half a minute. Accordingly, participants shouldn’t even have needed to be walking; asking them to turn and stand in place should be expected to have precisely the same effect. If the researchers wanted to measure walking to the right or left, they should have had participants moving to the side by sliding, rather than turning and walking forward.

Other obvious criticisms of the research could include the small sample size, the small effect size, the inconsistency of the effect (works for addition but not subtraction and is inconsistent with other research they cite which was itself inconsistent – people being better at addition when going up in an elevator but not walking up stairs, if I understand correctly), or the complete lack of anything resembling a real theory guiding the research. But let’s say for a moment that my impression of these results as silly is incorrect; let’s assume that these results accurately describe the workings of human mind in some respect. What are the implications of that finding? What, in other words, happens to be at stake here? Why would this research be published, relative to the other submissions received by Frontiers in Psychology? Even if it’s a true effect – which already seems unlikely, given the aforementioned issues – it doesn’t seem particularly noteworthy. Should people be turning to the right and left while taking their GREs? Do people need to be doing jumping jacks to improve their multiplication skills so as to make their body look more like the multiplication symbol? If so, how could you manage to do them while you’re supposed to be sitting down quietly while taking your GREs without getting kicked out of the testing site? Perhaps someone more informed on the topic could lend a suggestion, because I’m having trouble seeing the importance of it.

Maybe the insignificance of the results is supposed to make the reader feel more important

Without wanting to make a mountain out of a mole hill, this paper was authored by five researchers and presumably made it passed an editor and several reviewers before it saw publication. At a minimum, that’s probably about 8 to 10 people. That seems like a remarkable feat, given how strange the paper happens to look on its face. I’m not just mindlessly poking fun at the paper, though: I’m bringing attention to it because it seems to highlight a variety of problems in the world of psychological research. There are, of course, many suggestions as to how these problems might be ferreted out, though many of them that I have seen focus more on statistical solutions or combating researcher degrees of freedom. While such measures might reduce the quantity of bad research (like pre-registering studies), they will be unlikely to increase the absolute quality of good work (since one can pre-register silly ideas like this), which I think is an equally valuable goal. For my money, the requirement of some theoretical functional grounding for research would likely be the strongest candidate for improving work in psychology. I imagine many people would find it harder to propose such an idea in the first place if they needed to include some kind of functional considerations as to why turning right makes you better at addition. Even if such a feat was accomplished, it seems those considerations would make the rationale for the paper even easier to pick apart by reviewers and readers.

Instead of asking for silly research to be conducted on larger, more diverse samples, it seems better to ask that silly research not be conducted at all.

References: Anelli, F., Lugli, L., Baroni G., Borghi, A., & Nicoletti, R. (2014). Walking boosts your performance in making additions and subtractions. Frontiers in Psychology, 5, doi: 10.3389/fpsyg.2014.01459

(Some Of) My Teaching Philosophy

Over the course of my time at various public schools and universities I have encountered a great many teachers. Some of my teachers were quite good. I would credit my interest in evolutionary psychology to one particularly excellent teacher – Gordon Gallup. Not only was the material itself unlike anything I had previously been presented with in other psychology courses, but the way Gordon taught his classes was unparalleled. Each day he would show up and, without the aid of any PowerPoints or any apparent notes, just lecture. On occasion we would get some graphs or charts drawn on the board, but that was about it. What struck me about this teaching style is what it communicated about the speaker: this is someone who knows what he’s talking about. His command of the material was so impressive I actually sat through his course again for no credit in the follow years to transcribe them (and the similarity from year-to-year was remarkable, given that lack of notes). It was just a pleasure listening to him do what we did best.

A feat I was recently recognized for

That I say Gordon was outstanding is to say he was exceptional, relative to his peers (even if many of those peers, mistakenly, believe they are exceptional as well). The converse to that praise, then, is that I have encountered many more professors who were either not particularly good at what they did or downright awful at it (subjectively speaking, of course). I’ve had some professors who act, more or less, as an audio guide to the textbook that, when questioned, didn’t seem to really understand the material they were teaching; I’ve had another tell his class “now, we know this isn’t true, but maybe it’s useful” as he reviewed Maslow’s hierarchy of needs for what must have been the tenth time in my psychology education – a statement which promptly turned off my attention for the day. The number of examples I could provide likely outnumber my fingers and toes, so there’s no need to detail each one. In fact, just about everyone who has attended school has had experiences like this. Are these subjective evaluations of teachers that we have all made accurate representations of their teaching ability, though?

According to some research by Braga et al (2011), that answer is “yes”, but in a rather perverse sense: teacher evaluations tend to be negatively predictive of actual teaching effectiveness. In other words, at the end of a semester when a teacher receives evaluations from their students, the better these evaluations, the less effective the teacher tends to be. As someone who received fairly high evaluations from my own students, this should either be cause for some reflection as to my methods (since I am interested in my students learning; not just their being satisfied with my course) or a hunt for why the research in question must be wrong to make me feel better about my good reviews. In the interests of prioritizing my self-esteem, let’s start by considering the research and seeing if any holes can be poked in it.

“Don’t worry; I’m sure those good reviews will still reflect well on you”

Braga et al (2011) analyzed data from a private Italian university offering programs in economics, business, and law in 1998/9. The students in these programs had to take a fixed course of classes with fixed sets of materials and the same examinations. Additionally, students were randomly assigned to professors, making this one of the most controlled academic settings for this kind of research I could imagine. At the end of the terms, students provided evaluations of their instructors, allowing their ratings of instructors to be correlated – at the classroom level, as the evaluations were anonymous – with their performance in being effective teachers.

Teaching effectiveness was measured by examining how students did in subsequent courses, (controlling for a variety of non-teacher factors, like class size) the assumption being that students with better professors in the first course would do better in future courses, owing to their more proficient grasping of the material. These non-teacher factors accounted for about 57% of the variance in future course grades, leaving plenty of room for teacher effects. The effect of teachers was appreciable, with an increase of one standard deviation in effectiveness led to gain of about 0.17 standard deviations of grade in future classes (about a 2.3% bump up). Given the standardized materials and the gulf which could exist between the best and worst teachers, it seems there’s plenty of room for teacher effectiveness to matter. Certainly no students want to end up at a disadvantage because of a poor teacher; I know I wouldn’t.

When it came to the main research question, the results showed that teachers who were the least effective in providing future success for their students tended to receive the highest evaluations. This effect was sizable as well: for each standard deviation increase in teaching effectiveness, student evaluation ratings dropped by about 40% of a standard deviation. Perhaps unsurprisingly, grades were correlated with teaching evaluations as well: the better grades the students received, the better the evaluations they tended to give the professors. Interestingly, this effect did not exist in classes comprised of 25% or more of the top students (as measured by their cognitive entrance exams); the evaluations of those classes were simply not predictive of effectiveness.

That last section is the part of the paper that most everyone will cite: the negative relationship between teacher evaluations and future performance. What fewer people seem to do when referencing that finding is consider why this relationship exists and then use that answer to inform their teaching styles (as I get the sense this information will quite often be cited to excuse otherwise lackluster evaluations, rather than to change anything). The authors of the paper posit two main possibilities for explaining this effect: (1) that some teachers make class time more entertaining at the expense of learning, and/or (2) that some teachers might “teach for the test”, even if they do so at the expense of “true learning”. While neither possibility is directly tested in the paper, the latter possibility strikes me as most plausible: students in the “teaching for the test” classes might simply focus on the particular chunks of information relevant for them at the moment, rather than engaging it as a whole and understanding the subject more broadly.

In other words, vague expectations encourage cramming with a greater scope

With that research in mind, I would like to present a section of my philosophy when it came to teaching and assessment. A question of interest that I have given much thought to is what, precisely, are grades aimed at achieving? For many professors – indeed, I’d say the bulk of them – grades serve the ends of assessment. The grades are used to tell people – students and others – how well the students did at understanding the material come test time. My answer to this question is a bit different, however: as an instructor, I had no particular interest in the assessment of students per se; my interest was in their learning. I only wanted to assess my students as a means of pushing them to the end of learning. As a word of caution, my method of assessment demands substantially more effort from those doing the assessing, be it a teacher or assistant, than is typical. It’s an investment of time many might be unwilling to make.

My assessments were all short-essay style questions, asking students to apply theories they have learned about to novel questions we did not cover directly in class; there were no multiple choice questions. According to the speculations of Braga et al (2011), this would put me firmly in the “real teaching” camp, instead of the “teaching to the test” one. There are a few reasons for my decision: first, multiple choice questions don’t allow you to see what the students were thinking when answering the question. Just because someone gets an answer correct on a multiple choice exam, it doesn’t mean they got the correct answer for the right reasons. For my method to be effective, however, it does mean someone needs to read the exams in depth instead of just feeding them through a scantron machine, and that reading takes time. Second, essay exams force students to confront what they do and do not know. Having spent many years as a writer (and even more as a student), I’ve found that many ideas that seem crystal clear in my head do not always translate readily to text. The feeling of understanding can exist in lack of actual understanding. If students find they cannot explain an idea as readily as felt they understood it, that feeling might be effectively challenged, yielding a new round of engagement with the material.

After seeing where the students were going wrong, the essay format allowed me to make notes on their work and hand it back to them for revisions; something you can’t do very well with multiple choice questions either. Once the students had my comments on their work, they were free to revise it and hand it back into me. The grade they got on their revisions would be their new grade: no averaging of the two or anything of the sort. The process would then begin again, with revisions being made on revisions, until the students were happy with their grade or stopped trying. In order for assessment to serve the end of learning, assessment needs to be ongoing if you expect learning to be. If assessment is not ongoing, students have little need to fix their mistakes; they’ll simply look at their grade and then toss their test in the trash as many of them do. After all, why would they bother putting in the effort to figure out where they went wrong and how to go right if doing so successfully would have no impact whatsoever on the one thing they get from the class that people will see?

Make no mistake: they’re here for a grade. Educations are much cheaper than college.

I should also add that my students were allowed to use any resource they wanted for the exams, be that their notes, the textbook, outside sources, or even other students. I wanted them to engage with the material and think about it while they worked, and I didn’t expect them to have it all memorized already. In many ways, this format mirrors the way academics function in the world outside the classroom: when writing our papers, we are allowed to access our notes and references whenever we want; we are allowed to collaborate with others; we are allowed – and in many cases, required – to make revisions to our work. If academics were forced to do their job without access to these resources, I suspect the quality of it would drop precipitously. If these things all improve the quality of our work and help us learn and retain material, asking students to discard all of them come test time seems like a poor idea. It does require test questions to have some thought put into their construction, though, and that means another investment of time.

Some might worry that my method makes things too easy on the students. All that access to different materials means they could just get an easy “A”, and that’s why my evaluations were good. Perhaps that’s true, but just as my interest is not on assessment, my interest is also not on making a course “easy” or “challenging”; it’s on learning, and tests should be as easy or hard as that requires. As I recall, the class average for each test started at about a 75; by the end of the revisions, the average for each test had risen to about a 90. You can decide from those numbers whether or not that means my exams were too easy.

Now I don’t have the outcome measures that Braga et al (2011) did for my own teaching success. Perhaps my methods were a rousing failure when it came to getting students to learn, despite the high evaluations they earned me (in the Braga et al sample, the average teacher rating was 7 out of 10 with a standard deviation of 0.9; my average rating would be around a 9 on that scale, placing my evaluations about two standard deviations above the mean); perhaps this entire post reflects a defensiveness on my part when it comes to, ironically, having to justify my positive evaluations, just as I suspect people who cite this paper might use the results to justify relatively poor evaluations. In regards to the current results, I think both myself and others have room to be concerned: just because I received good evaluations, it does not mean my teaching method was effective; however, just because you received poor evaluations, it does not mean your teaching method is effective either. Just as students can get the right answer for the wrong reason, they can also give a teacher a good or bad evaluation for the right or wrong reasons. Good reviews should not make teachers complacent, just as poor reviews should not be brushed aside. The important point is that we both think about how to improve on our effectiveness as teachers.

References: Braga, M., Paccagnella, M., & Pellizzari, M. (2011). Evaluating students’ evaluations of professors. Economics of Education Review, 41, 71-88.