This is a follow-up to an incredibly insightful post by Charles Petzold. You really should read it first, it’s quite excellent.

He examines statistical noise and shows that the *per capita* rate of mass shootings is a stupid measure because there simply isn’t enough data.
He also demonstrates that the US would need to have roughly twice the population it does for it to have a representative portion of the mass shootings in the dataset.

This is a different look at the same dataset (obtained here, but trimmed down a bit). We’re going to look at how Bayesian statistics can answer the following questions for us:

Does the US have an unnaturally high number of mass shootings relative to the other countries in the dataset?

If so (it is), how bad is it?

## First Blush

So the assumption we want to verify is that the US has an expected number of mass shootings given it’s population for the rest of the dataset. We can load the data and take a look at these numbers pretty easily. Click on the arrow to see the code if you’re inclined.

**Warning**: what follows will probably make you sad.

```
US fraction of total population: 0.25
Non-US fraction of total population: 0.75
US total mass shootings: 38 (0.62)
Non-US total mass shootings: 23 (0.38)
```

So, a country with a quarter of the population is responsible for well over half of the total number of rampage shootings from the countries measured.
Seems suspicious.
What is the probability that we’re looking at noise?
Well, that question is answered with the *binomial test*.

If we make the assumption that the rampage shootings are evenly distributed across the *total population*, then the US’s share should be centered around 0.25.
If we write this as a Binomial distribution with 61 trials (38 + 23) and 38 “successes” with a probability of 0.25, not only do we get a morbidly disturbing statistical experiment, but a nice number that tells us just how likely it is that we *don’t* have a mass shooting problem.

```
1.5947056420909307e-10
```

That is a laughably small number. Or it would be if it wasn’t so sad.

What this doesn’t tell us is what the probability *should* be, or how certain we are of it.
For that, we’re going to bring in the 800 pound gorilla of Bayesian inference: Markov Chain Monte Carlo.

## Bayesian Statistics

Before we get into MCMC, let’s talk just a little bit about Bayesian inference without complicated equations and other BS. First we need to frame what we’re specifically looking for, so I’m going to do that right now. It starts with the data: we’ve basically reduced our dataset into two points: number of mass shootings in the US, and number of shootings not in the US.

What we’re attempting to infer is, *given a mass shooting incident, what is the probability that it occurs in the United States*?
If that probability is significantly different from the proportion of the population the US represents, we have an “above average” problem here (hint: we’ve already answered this, and it is yes).
What Bayesian inference allows us to do is specify what we *think* the probability of a mass shooting being in the US is, and modify that based on the data presented.

So, as a random (not really random) example, we can specify that we think the probability of a mass shooting being in the US is a flat line between zero and one.
This represents the minimal assumption about the value.
Or we could encode our population information about it, and state that it is our *opinion* it should be about 0.25, with some uncertainty.
The assumption about the probability we’re trying to determine is called a *prior* because it represents our knowledge of the system before any evidence (data) is observed.
When we incorporate that evidence, we get a *posterior* on the quantity.

**ASIDE**: Note that we’re talking about probability distributions of a *probability itself*.
It’s a little Inceptiony, but what this basically boils down to is that we’re uncertain about what that probability should be.
The 0.61 number we expect to see from the data comes from data, which itself is noisy.
The uncertainty is expressed by returning a distribution.

So the answer we get is not a single value like the binomial test. We can always roll up the distribution with some measure of central tendency like mean or median when we need it though.

## Markov Chain Monte Carlo

Markov Chain Monte Carlo (MCMC - we won’t discuss what’s Markov-chainy here, it’s kind of complicated) is a sampling technique that can be used to perform inference of really complicated distributions, even distributions that are made up of other distributions like what we have. It does this by simulating the values from the individual distributions and converging on a solution through what is basically an optimization scheme. Instead of finessing though some complicated closed-form calculation which is probably not possible in the first place, it uses brute computational power.

Luckily our distribution is pretty simple. So simple, in fact, it can be analytically solved. But, then you’d be looking at a bunch of equations and not-as-pretty pictures.

### The Model

So our model is this: we choose a uniform prior between zero and one. There are a few ways to write this, and for reasons that will become clear later, we’ll write this with a beta distribution, \(p \sim Beta(1,1)\). Betas sit between zero and one, and can take a huge variety of shapes, including the flat uniform. They’re handy.

The actual shooting count will be a binomial whose probability is set to the distribution of the prior, with number of trials equal to the total number of incidents and number of “successes” being the number of shootings in the US, \(n_s \sim Binom(n_{us}~|~ p, n_{total}) = Binom(38 ~|~ p, 61)\).
This function \(Binom(38 ~|~ p, 61) \sim f(p)\) is called the *likelihood*, in case you’re wondering.
It’s basically a pdf that’s a function of the parameter instead of the observations.

So we’ve drawn the samples - now we can take a look at what the distribution of \(p\) looks like.

So the p distribution is centered around about 0.6, which is the fraction of mass shooting incidents indicated by the data. The distribution tells us more than that though (otherwise what’s the point?). It tells us the uncertainty of the probability value. In this case it’s pretty wide. I would reasonably believe the “true” value could be anwhere from 0.45 to 0.75, which is like 25% on either side of 0.6. The reason for this spread is the amount of data. 38/61 isn’t a huge number of points, not nearly as much as, say, 100/200 or 1000/2000. That said, I want to be clear this is one case where I’m kind-of glad I don’t have more data.

Let’s look at a couple of summary statistics: mean, standard deviation, and inter-quartile range.

```
mean : 0.618652
sd : 0.060390
iq range : 0.082724
```

One more question before moving on. What’s the probability under this posterior of the assumption that the US has a probability of mass shooting of 0.25?

Remember, 0.25 is the number that says, *“We’re not any worse off than other countries”*.

Well that’s shocking.

It’s not though.

So what if we *did* want to bias the initial values for p?
What if we were really convinced that the US was just as bad off as everyone else?
We could encode that position in the prior, and re-run the simulation.
How do we choose the ideal parameters?
We could pick a mean and standard deviation and solve for them like a real math person, or we could just plot it and pick something that looks decent.

The mean of a \(Beta(\alpha,\beta)\) is \(\frac{\alpha}{\alpha + \beta}\). If we want that mean to be 0.25, we have several options:

- \(\alpha = 0.25, \beta = 0.75\)
- \(\alpha = 1, \beta = 3\)
- \(\alpha = 2, \beta = 6\)

… and so forth.

Our other requirement is that the prior drop *off* away from the mean.
As you’ll see in a minute, that’s not necessarily true for all of these priors.

None of these are great, but \(Beta(2,6)\) looks like it’s headed the right direction. Let’s look at one more: \(Beta(4,12)\). I’ll put \(Beta(2,6)\) up there too to show the difference.

Now we’re getting somewhere. We can use \(Beta(4,12)\) as a prior and see how much it gets moved by the data.

Keep in mind that what this means is that we are actually biased towards a p = 0.25 value, meaning we’re biased toward the fact that the US *isn’t* worse off than the other nations in the dataset when it comes to the number of mass shootings.
I’ll annotate the 0.25 point again as well.

So while the posterior as a whole is definitely pulled to the left by the prior, 0.25 is a bit of a stretch. And by a bit, I mean pretty much not possible.

```
mean : 0.545115
sd : 0.056879
iq range : 0.076839
```

So let’s really rachet up the prior. Lets go to \(Beta(8,24)\) and see what the probability of a 0.25 mass shooting probability is. I’ll plot it with \(Beta(4,12)\) to show their relative strengths.

That’s pretty substantial bias. Let’s run the simulation one more time and see what we get. The results should not surprise you.

Oh look, we’re almost touching the line.

```
mean : 0.492897
sd : 0.051340
iq range : 0.070358
```

Are there even any samples equal to or less than 0.25?

```
Number of samples supporting "not a problem": 0
```

Even if we very strongly bias our initial estimate of a mass shooting’s chances of coming from the US towards “it’s not worse than anywhere else” we still get a really small chance of that actually being the case. The evidence against it is staggering. Out of ten thousand samples, not even one is less than or equal to 0.25.

So the next time someone says the US doesn’t have a mass shooting problem, this dataset says there’s a more than 99.99% chance they’re full of shit.