A crash course in Bayesian mixed models with brms (Lesson 1)
What is this class?
A brief and practical introduction to fitting Bayesian multilevel models in R and Stan
Using brms (Bayesian Regression Models using Stan)
Quick intro to Bayesian inference
Mostly practical skills
Minimal prerequisites
Know what mixed-effects or multilevel model is
A little experience with stats and/or data science in R
Vague knowledge of what Bayesian stats are
Advanced prerequisites
Knowing about the lme4 package will help
Knowing about tidyverse and ggplot2 will help
How to follow the course
Slides and text version of lessons are online
Fill in code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind
Conceptual learning objectives
At the end of this course, you will understand …
The basics of Bayesian inference
What a prior, likelihood, and posterior are
The basics of how Markov Chain Monte Carlo works
What a credible interval is
Practical learning objectives
At the end of this course, you will be able to …
Write brms code to fit a multilevel model with random intercepts and random slopes
Diagnose and deal with convergence problems
Interpret brms output
Compare models with LOO information criteria
Use Bayes factors and “Bayesian p-values” to assess strength of evidence for effects
Make plots of model parameters and predictions with credible intervals
What is Bayesian inference?
What is Bayesian inference?
A method of statistical inference that allows you to use information you already know to assign a prior probability to a hypothesis, then update the probability of that hypothesis as you get more information
Used in many disciplines and fields
We’re going to look at how to use it to estimate parameters of statistical models to analyze scientific data
Powerful, user-friendly, open-source software is making it easier for everyone to go Bayesian
Bayes’ Theorem
Thomas Bayes, 1763
Pierre-Simon Laplace, 1774
Bayes’ Theorem
\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
How likely an event is to happen based on our prior knowledge about conditions related to that event
The conditional probability of an event A occurring, conditioned on the probability of another event B occurring
Bayes’ Theorem
\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
The probability of A being true given that B is true (\(P(A|B)\))
is equal to the probability that B is true given that A is true (\(P(B|A)\))
times the ratio of probabilities that A and B are true (\(\frac{P(A)}{P(B)}\))
Bayes’ theorem and statistical inference
Let’s say \(A\) is a statistical model (a hypothesis about the world)
How probable is it that our hypothesis is true?
\(P(A)\): prior probability that we assign based on our subjective knowledge before we get any data
Bayes’ theorem and statistical inference
We go out and get some data \(B\)
\(P(B|A)\): likelihood is the probability of observing that data if our model \(A\) is true
Use the likelihood to update our estimate of probability of our model
\(P(A|B)\): posterior probability that model \(A\) is true, given that we observed \(B\).
Bayes’ theorem and statistical inference
\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
What about \(P(B)\)?
marginal probability, the probability of the data
Basically just a normalizing constant
If we are comparing two models with the same data, the two \(P(B)\)s cancel out
Restating Bayes’ theorem
\[P(model|data) \propto P(data|model)P(model)\]
\[posterior = likelihood \times prior\]
what we believed before about the world (prior) × how much our new data changes our beliefs (likelihood) = what we believe now about the world (posterior)
Example
Find a coin on the street. What is our prior estimate of the probability of flipping heads?
Now we flip 10 times and get 8 heads. What is our belief now?
Probably doesn’t change much because we have a strong prior and the likelihood of probability = 0.5 is still high enough even if we see 8/10 heads
Shady character on the street shows us a coin and offers to flip it. He will pay $1 for each tails if we pay $1 for each heads
What is our prior estimate of the probability?
He flips 10 times and gets 8 heads. What’s our belief now?
In classical “frequentist” analysis we cannot incorporate prior information into the analysis
In each case our point estimate of the probability would be 0.8
Bayesian vs. frequentist probability
Probability, in the Bayesian interpretation, includes how uncertain our knowledge of an event is
Example: Before the 2016 Olympics I said “The probability that Usain Bolt will win the gold medal in the men’s 100 meter dash is 75%.”
In frequentist analysis, one single event does not have a probability. Either Bolt wins or Bolt loses
In frequentist analysis, probability is a long-run frequency. We could predict if the 2016 men’s 100m final was repeated many times Bolt would win 75% of them
But Bayesian probability sees the single event as an uncertain outcome, given our imperfect knowledge
Calculating Bayesian probability = giving a number to a belief that best reflects the state of your knowledge
Bayes is computationally intensive
We need to calculate an integral to find \(P(data)\), which we need to get \(P(model|data)\), the posterior
But the “model” is not just one parameter, it might be 100s or 1000s of parameters
Need to calculate an integral with 100s or 1000s of dimensions
For many years, this was computationally not possible
Markov Chain Monte Carlo (MCMC)
Class of algorithms for sampling from probability distributions
The longer the chain runs, the closer it gets to the true distribution
In Bayesian inference, we run multiple Markov chains for a preset number of samples
Discard the initial samples (warmup)
What remains is our estimate of the posterior distribution
Hamiltonian Monte Carlo (HMC) and Stan
HMC is the fastest and most efficient MCMC algorithm that has ever been developed
It’s implemented in software called Stan
What is brms?
An easy way to fit Bayesian mixed models using Stan in R
Syntax of brms models is just like lme4
Runs a Stan model behind the scenes
Automatically assigns sensible priors and does lots of tricks to speed up HMC convergence
Bayes Myths: Busted!
Myth 1. Bayes is confusing
Myth 2. Bayes is subjective
Myth 3. Bayes takes too long
Myth 4. You can’t be both Bayesian and frequentist
Myth 1: Bayes is confusing — BUSTED!
Yes, it can be confusing at first
Largely because frequentist methods were, and still often are, the only ones taught, so Bayesian methods are unfamiliar
Frequentist methods can be just as confusing (focuses on rejecting null hypotheses and requires you to imagine a hypothetical population that may not exist to calculate long-run frequencies of events)
Myth 2: Bayes is subjective — BUSTED!
A very old misconception that dates back at least to R. A. Fisher
Bayesian analysis is no more subjective than frequentist
All statistical approaches make assumptions and require prior knowledge
Incorporating prior knowledge is not a bias to be avoided, but a benefit we should embrace
Selectively ignoring prior information = selectively cherry-picking data to include in an analysis
Myth 3: Bayes takes too long — BUSTED! (sort of)
There’s no denying that these models take a lot of computing time and memory
Two ways around it
Fast algorithms for Monte Carlo sampling that quickly converge on correct solution
Algorithms that approximate the solution of the integral without having to do full MC sampling
brms uses the first approach, check out INLA for the second approach
Myth 4: You can’t be both Bayesian and frequentist — BUSTED!
Each statistical approach is a tool in your toolkit, not a dogma to blindly follow
The differences between Bayesian and frequentist methods are often exaggerated
Model selection, cross-validation, and penalized regression, used in frequentist analysis, have similar effects to use of priors in Bayesian analysis
With the rise of machine learning the boundaries between these traditional approaches are blurring
Why use Bayes?
Some models just can’t be fit with frequentist maximum-likelihood methods
Adding priors makes the computational algorithms work better and get the correct answer
Estimate how big effects are instead of yes-or-no framework of rejecting a null hypothesis
We can say “the probability of something being between a and b is 95%” instead of “if we ran this experiment many times, the estimate would be between a and b 95% of the time.”
Second argument ~ N tells it to get means for each N level
N_emmeans <-emmeans(fit_fixedN_priors, ~ N)
Differences between each pair of means
Subtract the posterior samples of one mean from another
Done with contrast() function, specifying method = 'pairwise' to get all pairwise comparisons
contrast(N_emmeans, method ='pairwise')
Special extensions to ggplot2 for plotting quantiles of posterior distribution
gather_emmeans_draws() is used to put posterior samples from emmeans object into a dataframe
post_emm_draws <-gather_emmeans_draws(N_emmeans)ggplot(post_emm_draws, aes(y = N, x = .value)) +stat_halfeye(.width =c(.8, .95)) +labs(x ='estimated marginal mean yield')
ggplot(post_emm_draws, aes(y = N, x = .value)) +stat_interval() +stat_summary(fun = median, geom ='point', size =2) +scale_color_brewer(palette ='Blues') +labs(x ='estimated marginal mean yield')
Model with multiple fixed effects
Fixed-effect part is now 1 + N + P + K
fit_fixedNPK <-brm(yield ~1+ N + P + K + (1| site) + (1| block:site),data = stvincent,prior =c(prior(normal(0, 10), class = b)),chains =4,iter =2000,warmup =1000,seed =3,file ='fits/fit_fixedNPK')
Look at trace plots, posterior predictive check, and model summary.
What do they show?
Model with random slopes
So far we’ve assumed any predictor’s effect is the same at every site (only intercept varies, not slope)
Add random slope term to allow both intercept and slope to vary
Specify a random slope by adding the appropriate slope to the design side of the random effect specification
random intercept only (1 | site)
random intercept and random slope with respect to N (1 + N | site)
random intercept and random slope with respect to N, P, and K (1 + N + P + K | site)
Not enough data for block-level random slopes
fit_randomNPKslopes <-brm(yield ~1+ N + P + K + (1+ N + P + K | site) + (1| block:site),data = stvincent,prior =c(prior(normal(0, 10), class = b)),chains =4,iter =2000,warmup =1000,seed =4,file ='fits/fit_randomNPKslopes')
Look at trace plots, pp_check, and model summary
Lots of parameters because random intercept and three random slopes with three levels each all have variances and covariances
Predictions for N response from random slope model
We’re now averaging over random intercepts by site, random slopes by site, block effects within site, and the other fixed effects