As of late, I’ve been dipping my toes ever-deeper into the conceptual world of statistics. If one aspires towards understanding precisely what they’re seeing in when it comes to research in psychology, understanding statistics can go a long way. Unfortunately, the world of statistics is a contentious one and the concepts involved in many of these discussions can be easily misinterpreted, so I’ve been attempting to be as cautious as possible in figuring the mess out. Most recently, I’ve been trying to decipher whether the hype over Bayesian methods is to be believed. There are some people who seem to feel that there’s a dividing line between Bayesian and Frequentist philosophies that one must choose sides over (Dienes, 2011), while others seem to suggest that such divisions are basically pointless and the field has moved beyond them (Gelman, 2008; Kass, 2011). One of the major points which has been bothering me about the Bayesian side of things is the conceptualization of a “prior” (though I feel such priors can easily be incorporated in Frequentist analyses as well, so this question applies well to any statistician). Like many concepts in statistics, this one seems to both be useful in certain situations and able to easily lead one astray in others. Today I’d like to consider a thought experiment dealing with the latter cases.

Thankfully, thought experiments are far cheaper than real ones

First, a quick overview of what a prior is and why they can be important. Here’s an example that I discussed previously:

say that you’re doctor trying to treat an infection that has broken out among a specific population of people.

You happened to know that 5% of the people in this population are actually infectedand you’re trying to figure out who those people are so you can at least quarantine them. Luckily for you, you happen to have a device that can test for the presence of this infection. If you use this device to test an individual who actually has the disease, it will come back positive 95% of the time; if the individual does not have the disease, it will come back positive 5% of the time. Given that an individual has tested positive for the disease, what is the probability that they actually have it? The answer, unintuitive to most, is 50%.

In this example, your prior (bolded) is the percent of people who have the disease. The prior is, roughly, what beliefs or uncertainties you come to your data with. Bayesian analysis requires one to explicitly state one’s prior beliefs, regardless of what those priors are, as they will eventually play a role in determining your conclusions. Like in the example above, priors can be exceptionally useful when they’re known values.

In the world of research it’s not always (or even generally) the case that priors are objectively known: in fact, they’re basically what we’re trying to figure out in the first place. More specifically, people are actually trying to derive *posteriors* (prior beliefs that have been revised by the data), but one man’s posteriors are another man’s priors, and the line between the two is more or less artificial. In the previous example, we took the 5% prevalence in the population is taken as a given; if you *didn’t know* that value and only had the results of your 95% effective test, figuring out how many of your positives were likely false-positive and, conversely, how many of your negatives were likely false-negatives, would be impossible values to accurately estimate (except if you got lucky). If the prevalence of the disease in the population is very low, you’ll have many false-positives; if the prevalence is very high, you’ll likely have many false-negatives. Accordingly, what prior beliefs you bring to your results will have a substantial effect on how they’re interpreted.

This is a fairly common point discussed when it comes to Bayesian analysis: the frequent subjectivity of priors. Your *belief* about whether a disease is common or not doesn’t change the *actual* prevalence of it; just how you will eventually look at your data. This means that researchers with the same data can reach radically different conclusions on the basis on different priors. So, if one is given free-reign over which priors they want to use, this could allow confirmation bias to run wild and a lot of disagreeable data to be all but disregarded. As this is a fairly common point in the debate over Bayesian statistics, there’s already been a lot of ink (virtual and actual) spilled over it, so I don’t want to continue on with it.

There is, however, another issue concerning priors that, to the best of my knowledge, has not been thoroughly addressed. That question is to what extent we can consider people to have prior beliefs in the first place? Clearly, we feel that some things are more likely than others: I think it’s more likely that I won’t win the lottery than I will. No doubt you could immediately provide a list of things you think are more or less probable than others with ease. That these feelings can be so intuitive and automatically generated helps to mask an underlying problem with them: strictly speaking, it seems we ought to either not update our priors at all or not say that we “really” have any. A shocking assertion, no doubt, (and maybe a bit hyperbolic) but I want to explore it and see where it takes us.

Whether it’s to a new world or to our deaths, I’ll still be famous for it.

We can begin to explore this intuition with another thought experiment involving flipping a coin, which will be our stand-in for a random-outcome generator. Now this coin is slightly biased in a way that results in 60% of the flips coming up heads and the remaining 40% coming up tails. The first researcher has his entire belief centered 100% on the coin being 60% biased towards heads and, since there is no belief left to assign, thinks that all other states of bias are impossible. Rather than having a distribution of beliefs, this researcher has a single point. This first researcher will never update his belief about the bias of the coin no matter what outcomes he observed; he’s certain the coin is biased in a particular way. Because he just so happens to be right about the bias he can’t get any better and this is lack of updating his priors is a good thing (if you’re looking to make accurate predictions, that is).

Now let’s consider a second researcher. This researcher comes to the coin with a different set of priors: he thinks that the coin is likely fair, say 50% certain, and then distributes the rest of his belief equally between two additional potential values of the coin not being fair (say 25% sure that the coin is 60% biased towards heads and 25% sure that the coin is similarly biased towards tails). The precise distribution of these beliefs doesn’t matter terribly; it could come in the form of two or an infinite number of points. All that matters is that, because this researcher’s belief is distributed in such a way that it doesn’t lie on a single point, they are capable of being updated by the data from the coin flips. Researcher two, like a good Bayesian, will then update his priors to posteriors on the basis of the observed flips, then turn those posteriors into new priors and continues on updating for as long as he’s getting new data.

On the surface, then, the major difference between the two is that researcher one refuses to update his priors and researcher two is willing to do so. This implies something rather interesting about the latter researcher: researcher two has some degree of *uncertainty **about his priors*. After all, if he was already sure he had the right priors, he wouldn’t update, since he would think he could do not better in terms of predictive accuracy. If researcher two *is* uncertain about his priors, then, shouldn’t *that* degree of uncertainty similarly be reflected somehow?

For instance, one could say that researcher two is 90% certain that he got the correct priors and 10% certain that he did not. That would represent his priors about his priors. He would presumably need to have *some* prior belief about the distribution he initial chose, as he was selecting from an infinite number of other possible distributions. His prior *about his priors*, however, must have its own set of priors as well. One can quickly see that this leads to an infinite regress: at some point, researcher two will basically have to admit complete uncertainty about his priors (or at least uncertainty about how they ought to be updated, as how one updates their priors depends upon the priors one is using, and there are an infinite number of possible distributions of priors), or admit complete certainty in them. If researcher two ends up admitting to complete uncertainty, this will give him a flat set of priors that ought to be updated very little (he will be able to rule out 100% biased towards heads or tails, contingent on observing either a heads or tails, but not much beyond that). On the other hand, if researcher two ends up stating one of his priors with 100% certainty, the rest of the priors ought to collapse on each other to 100% as well, resulting in an unwillingness to update.

Then again, math has never been specialty. I’m about 70% sure it isn’t, and about 30% sure of that estimate…

It is not immediately apparent how we can reconcile these two stances with each other. On the one hand, researcher one has a prior that cannot be updated; on the other, researcher two has a potentially infinite number of priors with almost no idea how to update them. While we certainly could say that researcher one has a prior, he would have no need for Bayesian analysis. Given that people *seem* to have prior beliefs about things (like how likely some candidate is to win an election), and these beliefs *seem* to be updated from time to time (once most of the votes have been tallied), this suggests that something about the above analysis might be wrong. It’s just difficult to place precisely what that thing is.

One way of ducking the dilemma might be to suggest that, at any given point in time, people are 100% certain of their priors, but what point they’re certain about change over time. Such a stance, however, suggests that priors aren’t *updated* so much as priors just *change*, and I’m not sure that such semantics can save us here. Another suggestion that was offered to me is that we could just forget the whole thing as priors themselves don’t need to themselves have priors. A prior is a belief distribution about probability and probability is not a “real” thing (that is the biased coin doesn’t come up 60% and 40% tails *per flip*; the result will either be a heads or a tails). For what it’s worth, I don’t think such a suggestion helps us out. It would essentially seem to be saying that, out of the infinite number of beliefs one *could* start with, any subset of those beliefs is as good as any other, even if they lead to mutually-exclusive or contradictory results and we can’t think about why some of them are better than others. Though my prior on people having priors might have been high, my posteriors about them aren’t looking so hot at the moment.

**References:** Dienes, Z. (2011). Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspectives on Psychological Science, 6 (3), 274-290 DOI: 10.1177/1745691611406920

Gelman, A. (2008). Rejoinder. *Bayesian Analysis, 3,* 467-478.

Kass, R. (2011). Statistical inference: The big picture. *Statistical Science, 26, *1-9.

But if you’re 100.00000000000% sure that a coin has a 60% bias, you’re an idiot.

http://xkcd.com/1132/

Cheers

Jim

I’ve seen the comic before and understanding what the author of it gets wrong goes a long way.

There is a series of misconceptions in your reasoning about Bayesian (or even frequentist) inference that I wanted to clear up.

When you are doing Bayesian inference, and you define some hypothesis class, your prior needs to have full support on that space. In this degenerate case you are simply not doing inference, since your approach is equivalent to restricting your language is so restricted that it allows you to generate only one hypothesis.

This is only a challenge to inference in so much as Zeno’s paradox is a challenge to motion. If you have a prior over priors then you can just compute a single prior that encodes your prior over priors. The easiest example is to look at the following two priors: (A) i believe that with 90% certainty the coin will land heads 40% of time and 10% certain that it will land heads 60% of the time and (B) I believe with 90% certainty that the coin will land heads 60% of the time and 10% certainty it will lands heads 40% of the time. My prior over prions is uniform: 50% I expect (A) and 50% I expect (B). The information in the prior over priors can be collapsed into a single prior of “I am 50% certain the coin will land heads 40% of the time and 50% certain the coin will land heads 60% of the time”, all the Bayesian inference and predictions will be the same if you do inference over priors over priors or just the combined prior. Of course, you can repeat this procedure to collapse 3 or 4 or arbitrarily many levels of priors. In fact, you can collapse infinitely many levels of priors (although obviously the calculation gets more and more difficult with more and more levels).

In fact, this is approach is regularly done in non-toy models when you know that you want to model your process as a simple distribution (say a Gaussian) but don’t know the parameters of the distribution (say the variance). Instead you might have some prior (or prior of priors all the way down, as long as they are all computable). If you want a real example, then the first sentence of this paragraph is actually a description of what is done in financial modeling of stock returns.

In general though, people don’t just arbitrarily pick a prior, or infinite heirarchy of priors that they then need to collapse down. When doing the ‘best’ inference, a modeler determines what they mean by ‘best’ by defining an error function that they wish their learner to minimize (with respect to the ‘real distribution’ they are learning, in your case: how often the coin comes up heads).This error function can then be used to calculate what the best possible prior to start with is.

This is just plain wrong. The research will in fact have defined maximum likelihood estimate for the probability of heads (a blog post that guides you through it). In the error minimization scheme we described, it would be equivalent to minimizing the squared error loss function.

In general, however, these are not relevant points to understanding how Bayesians models human inference or if they make a good model. The above is just a clarification of technical difficulties with statistics. If you want to look at actual modeling questions, then consider the following Cognitive Sciences StackExchange questions:

What are some of the drawbacks to probabilistic models of cognition?

How can the success of Bayesian models be reconciled with demonstrations of heuristic and biased reasoning?

I think that’s right, except that your priors about your priors also need priors, so that leads to another recomputation of what priors you’re using. Then, go again for your priors about your priors about your priors. If this continues on, it will either hit a point where you are 100% confident in your prior (or your distribution of priors), meaning you have assigned a 0% belief state to

all otherstates of affairs (which means you cannot update your priors if you’re using Bayes’ theorem), or you continue on doing that infinitely, meaning you’ll have (a) a uniform distribution with (b) no idea how it should be updated. I don’t see a way out of that issue.I know something

soundswrong about that analysis in the same way that somethingsoundswrong about Zeno’s paradox. Zeno’s paradox seems to be easily resolved, though, in that it’s missing a time component that is part of the calculation of motion; the same cannot be said of Bayes’ theorem as it stands. If you still have belief in other possible distributions, you need to recalculate your priors to reflect that; if you don’t have any belief left, you can’t update.Clearly, beliefs

aresometimes updated on the basis of evidence; of that there can be little doubt. It’s what makes this example seem so strange. How that updating is done, however, doesn’t seem to be through the use of Bayes’ theorem.[EDIT] I feel it would be worthwhile to add an example: let’s return to the doctor example I raised in the post. Here, the doctor is starting with a given 5% prior about the prevalence of the disease. When the results of his test come back, let’s assume that, on the basis of that evidence, the doctor recalculates his prior belief about the prevalence of the disease: now he thinks it’s more common than it was beforehand. So one could say he updated his prior about the disease, but, if he did so, he would need to recalculate the results of his initial test with that new prior. Given that his prior is now higher, he might come to think that there were fewer false-positives than previously imagined. This, however, makes the disease seem

even more prevalent, given the same evidence. In other words, every time his priors change, his interpretation of the data changes, and every time the interpretation changes, so too should his priors, and so on.Bayes’ theorem works in the initial example because the priors are being used as a given to compute an unknown value. When priors are not taken as a given, however, Bayes’ theorem no longer works. The same data points could be used, it seems, to recalculate one’s priors, which would recalculate one’s likelihoods, which would recalculate one’s priors, and so on. Unless I’m missing something, like some stopping rule for doing so?