Introduction

What is this course? It is a crash course in brms: a brief and practical introduction to fitting Bayesian multilevel models in R and Stan using the brms package (Bayesian Regression Models using Stan). Because this is intended as a practical skills course, I will only give a quick and surface-level description of Bayesian analysis and how it can be used for multilevel regression models.

Download the worksheet for this lesson here.

What should you know coming into this course?

This course is designed for practicing researchers who have some experience with statistical analysis and coding in R. Don’t worry if you have zero experience with Bayesian statistics — that’s what I’m here to teach you!

Minimal prerequisites

  • You should have at least heard of a mixed model or multilevel model.
  • You should have at least tried to do stats and/or data science in R before.

Advanced prerequisites

  • If you have experience with the R package lme4, that’s great because the syntax of brms is modeled after lme4.
  • If you have experience with the tidyverse packages, including ggplot2, for manipulating and plotting data, that’s also great because we will be using them in this course.

If you don’t know what those packages are, don’t sweat it because you can just follow along by copying the code.

What will you learn from this course?

Conceptual learning objectives

At the end of this course, you will understand …

  • The very basics of Bayesian inference
  • What a prior, likelihood, and posterior are in the context of a Bayesian model
  • The very basics of how Markov Chain Monte Carlo works
  • What a credible interval is

Practical skills

At the end of this course, you will be able to …

  • Write the code to specify and sample a multilevel model with random intercepts and random slopes in brms
  • Generate plots and statistics to make sure your model converged
  • Interpret the summary output of a multilevel model in brms
  • Compare different models with leave-one-out information criteria
  • Use Bayes factors to assess strength of evidence for effects
  • Make plots of model parameters and predictions, with credible intervals to show uncertainty

What is Bayesian inference?

portrait presumed to be Thomas Bayes
This may or may not actually be a portrait of Thomas Bayes

In its simplest form, Bayesian inference is a method of statistical inference that allows you to use information you already know to assign a prior probability to a hypothesis, then update the probability of that hypothesis as you get more information. If that seems like common sense to you, then you’re starting to understand the appeal of the Bayesian way of doing things!

It is a very powerful tool that has been applied across many fields of human endeavor. Today we are only going to look at its application in estimating the parameters of multilevel statistical models to analyze scientific data, but that’s only one of the many things we can use Bayesian inference for. Bayesian methods are only growing in popularity, thanks in large part to the rapid growth of user-friendly, open-source, computationally powerful software – like Stan and its companion R package brms that we are going to learn about today!

Bayesian inference is named after Reverend Thomas Bayes, an English clergyman, philosopher, and yes, statistician, who wrote a scholarly work on probability (published after his death in 1763). His essay included an algorithm to use evidence to find the distribution of the probability parameter of a binomial distribution … using what we now call Bayes’ Theorem! However, much of the philosophical and mathematical foundations of Bayesian inference were really developed by Pierre-Simon Laplace independently of Bayes and at roughly the same time. Give credit where it’s due!

Bayes’ Theorem

photo of a neon sign of Bayes’ Theorem

Bayes’ Theorem is the basis of Bayesian inference. It is a rule that describes how likely an event is to happen (its probability), based on our prior knowledge about conditions related to that event. In other words, it describes the conditional probability of an event A occurring, conditioned on the probability of another event B occurring.

The mathematical statement of Bayes’ theorem is:

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

This means the probability of A being true given that B is true (\(P(A|B)\)) is equal to the probability that B is true given that A is true (\(P(B|A)\)) times the ratio of probabilities that A and B are true (\(\frac{P(A)}{P(B)}\)).

This may not make sense yet but some more concrete examples might help.

How Bayes’ theorem relates to statistical inference

Let’s say \(A\) is a statistical model. A statistical model is a hypothesis about the world. We want to know the probability that our hypothesis is true. We start out with some prior knowledge about the probability that our model \(A\) is true. So \(P(A)\) is called the prior probability. We go out there and get some data \(B\) to test our hypothesis. We can calculate the probability of observing that data \(B\), given that \(A\) is true: \(P(B|A)\). This is also known as the likelihood of the model \(A\), in light of our observations \(B\). We use this to update our estimate of the probability that \(A\) is true. We now have \(P(A|B)\), the probability our model, or hypothesis, is true, given the data we just observed.

For now we can ignore the denominator, \(P(B)\), the probability that the data are true. This is called the marginal probability. Usually the best we can do is to compare the relative probability of different models, given the data. We can’t really say (nor should we say) what the absolute probability of a model is. (“All models are wrong.”) If we are comparing two models \(A_1\) and \(A_2\), both fit using the same data \(B\), and we take the ratio \(\frac{P(A_1|B)}{P(A_2|B)} = \frac{P(B|A_1)P(A_1)/P(B)}{P(B|A_2)P(A_2)/P(B)}\), the two \(P(B)\)s cancel out. So we can temporarily ignore the denominator and write Bayes’ theorem as

\[P(model|data) \propto P(data|model)P(model)\]

or

\[posterior = likelihood \times prior\]

or, flipping the left and right sides around,

what we believed before about the world (prior) × how much our new data changes our beliefs (likelihood) = what we believe now about the world (posterior)

That’s pretty cool!

A “real-life” example of Bayes’ theorem

Let’s say we pick up a coin off the street. We have a very strong prior belief that the true probability of flipping heads on any given flip is 0.5. Let’s say we flipped the coin 10 times and got 8 heads. We probably wouldn’t change that belief very much based on that evidence. It simply isn’t enough evidence to sway us off that belief. The likelihood of getting 8 heads, given that the true probability is 0.5, is still high enough that our posterior estimate of the probability is still around 0.5.

But let’s say a shady character on the street corner approaches us, shows us a coin, and offers to flip the coin 10 times. He offers to pay us $1 for every tails if we pay him $1 for every heads. Most likely, before we even got any new evidence, we’d be likely to think that the true probability of heads is not exactly 0.5. Or at least we would have a more skeptical prior belief, and we would be willing to entertain the possibility that the true probability of heads might be higher than 0.5. So now if we observe 8 heads, we might update our estimate of the true probability of heads to something higher, maybe 0.7.

photo of a magician who can control coin flips
Our shady character might’ve studied this YouTube video.

In a classical, “frequentist” analysis, there is no formal way to incorporate prior information into a statistical model. In both cases, we have no choice but to say the point estimate of the true probability of heads is 0.8.

The critical point here is that probability, in the Bayesian interpretation, incorporates our uncertain knowledge about an event. For example we might say that the probability of the stock market going up tomorrow is 55%. In the frequentist interpretation, either the stock market will go up tomorrow or it will not. Frequentist probability is just a frequency: if we hypothetically repeated “tomorrow” under the same conditions many times, 55% of the time the stock market would go up. But Bayesian probability can handle the uncertainty. When you calculate a Bayesian probability you are giving a number to a belief, a number that best reflects the state of your knowledge.

Why is Bayes so computationally intensive?

The coin flip example was pretty easy — you can do it in your head! Why do we need such advanced computational tools to do Bayesian inference?

The main reason that Bayesian statistics were not widely adopted initially is probably a practical one. It boils down to \(P(data)\), the marginal probability. We need it to estimate \(P(model|data)\), the posterior. You may have heard of “maximum likelihood” methods. Maximum likelihood algorithms are used to fit frequentist models. They try to find the set of parameters that make the data the most probable. In the Bayesian world, we are still trying to do that, but instead of getting a single point estimate of the parameters that are most likely given the data, we are now trying to estimate a joint distribution over all possible combinations of values the parameters could take, then normalize it by dividing by the marginal probability. We end up with a distribution for all the parameters that is most likely given the data we observed.

If our “model” is just a single parameter like the probability of a coin flip being heads, and our “data” is just a count of heads out of 10, the calculations are trivial. But what if our model is a multilevel regression model with a bunch of fixed effects, a bunch of random intercepts at multiple levels, and variance-covariance parameters for all the errors (like we are about to fit in this very lesson)? We might have hundreds of parameters. To estimate \(P(data)\), we have to integrate a function across n-dimensional space, where n is the total number of parameters in the model!

If you remember even trying to do two-dimensional integration on paper from calculus class, imagine doing 100-dimensional integration or even more! There just isn’t enough time in the universe to get a precise solution for an integral with that many dimensions. In the time of Thomas Bayes, and for hundreds of years after that, Bayesian statistics simply couldn’t be applied to real-world problems because of this computational limitation.

Markov Chain Monte Carlo

That’s where Markov Chain Monte Carlo (MCMC), aided by super-fast modern microprocessors, comes in. What is MCMC? It’s a class of algorithms for drawing samples from a probability distribution. The longer the algorithm runs, the closer it gets to the true distribution. I won’t describe it in detail here but the best beginner-level description of MCMC I’ve ever read is in Richard McElreath’s book Statistical Rethinking.

Because fitting complicated Bayesian models requires you to characterize a very high-dimensional probability distribution (the posterior), most Bayesian inference uses MCMC (although there is an alternative way to approximate Bayesian inference). Usually in practice you run multiple Markov chains for a prespecified number of samples, discard some samples from the beginning of the chains when they were still searching for the correct solution, and retaining only the samples that you think characterize the true posterior distribution. Running multiple chains that start at different initial values is a good idea. If more than one chain “converges” on the same distribution even though they started at different points, that’s a good indication that you’ve found the correct distribution.

What is Stan?

Stan software logo

MCMC is a class of algorithms — there’s more than one way to do it. In 1987, some physicists proposed a new MCMC algorithm called Hamiltonian Monte Carlo (HMC) to do calculations in quantum physics. I won’t describe HMC here, except to say that it really works! Older MCMC algorithms are not nearly as efficient as HMC. HMC can converge on a solution ten times faster, if not more, than other algorithms. It makes Bayesian inference so much more practical. Recently, Andrew Gelman and colleagues developed Stan, which is a software platform for Bayesian statistical modeling using HMC. The models we will fit with brms in this lesson use Stan for back-end computation.