A crash course in Bayesian mixed models with brms (Lesson 1)

What is this class?

A brief and practical introduction to fitting Bayesian multilevel models in R and Stan
Using brms (Bayesian Regression Models using Stan)
Quick intro to Bayesian inference
Mostly practical skills

Minimal prerequisites

Know what mixed-effects or multilevel model is
A little experience with stats and/or data science in R

Advanced prerequisites

Knowing about the lme4 package will help
Knowing about tidyverse and ggplot2 will help

How to follow the course

Slides and text version of lessons are online
Fill in code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind

Conceptual learning objectives

At the end of this course, you will understand …

The basics of Bayesian inference
What a prior, likelihood, and posterior are
The basics of how Markov Chain Monte Carlo works
What a credible interval is

Practical learning objectives

At the end of this course, you will be able to …

Write brms code to fit a multilevel model with random intercepts and random slopes
Diagnose and deal with convergence problems
Interpret brms output
Compare models with LOO information criteria
Use Bayes factors and “Bayesian p-values” to assess strength of evidence for effects
Make plots of model parameters and predictions with credible intervals

What is Bayesian inference?

portrait presumed to be Thomas Bayes

What is Bayesian inference?

A method of statistical inference that allows you to use information you already know to assign a prior probability to a hypothesis, then update the probability of that hypothesis as you get more information

Used in many disciplines and fields
We’re going to look at how to use it to estimate parameters of statistical models to analyze scientific data
Powerful, user-friendly, open-source software is making it easier for everyone to go Bayesian

Bayes’ Theorem

photo of a neon sign of Bayes’ Theorem

Thomas Bayes, 1763
Pierre-Simon Laplace, 1774

Bayes’ Theorem

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

How likely an event is to happen based on our prior knowledge about conditions related to that event
The conditional probability of an event A occurring, conditioned on the probability of another event B occurring

Bayes’ Theorem

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

The probability of A being true given that B is true ($P(A|B)$)
is equal to the probability that B is true given that A is true ($P(B|A)$)
times the ratio of probabilities that A and B are true ($\frac{P(A)}{P(B)}$)

Bayes’ theorem and statistical inference

Let’s say $A$ is a statistical model (a hypothesis about the world)
How probable is it that our hypothesis is true?
$P(A)$: prior probability that we assign based on our subjective knowledge before we get any data

Bayes’ theorem and statistical inference

We go out and get some data $B$
$P(B|A)$: likelihood is the probability of observing that data if our model $A$ is true
Use the likelihood to update our estimate of probability of our model
$P(A|B)$: posterior probability that model $A$ is true, given that we observed $B$.

Bayes’ theorem and statistical inference

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

What about $P(B)$?
marginal probability, the probability of the data
Basically just a normalizing constant
If we are comparing two models with the same data, the two $P(B)$s cancel out

Restating Bayes’ theorem

\[P(model|data) \propto P(data|model)P(model)\]

\[posterior = likelihood \times prior\]

what we believed before about the world (prior) × how much our new data changes our beliefs (likelihood) = what we believe now about the world (posterior)

Example

Find a coin on the street. What is our prior estimate of the probability of flipping heads?
Now we flip 10 times and get 8 heads. What is our belief now?
Probably doesn’t change much because we have a strong prior and the likelihood of probability = 0.5 is still high enough even if we see 8/10 heads

Shady character on the street shows us a coin and offers to flip it. He will pay $1 for each tails if we pay $1 for each heads
What is our prior estimate of the probability?
He flips 10 times and gets 8 heads. What’s our belief now?

photo of a magician who can control coin flips

In classical “frequentist” analysis we cannot incorporate prior information into the analysis
In each case our point estimate of the probability would be 0.8

Bayesian vs. frequentist probability

Usain Bolt, the GOAT

Probability, in the Bayesian interpretation, includes how uncertain our knowledge of an event is
Example: Before the 2016 Olympics I said “The probability that Usain Bolt will win the gold medal in the men’s 100 meter dash is 75%.”
In frequentist analysis, one single event does not have a probability. Either Bolt wins or Bolt loses
In frequentist analysis, probability is a long-run frequency. We could predict if the 2016 men’s 100m final was repeated many times Bolt would win 75% of them
But Bayesian probability sees the single event as an uncertain outcome, given our imperfect knowledge
Calculating Bayesian probability = giving a number to a belief that best reflects the state of your knowledge

Bayes is computationally intensive

We need to calculate an integral to find $P(data)$, which we need to get $P(model|data)$, the posterior
But the “model” is not just one parameter, it might be 100s or 1000s of parameters
Need to calculate an integral with 100s or 1000s of dimensions
For many years, this was computationally not possible

Markov Chain Monte Carlo (MCMC)

Class of algorithms for sampling from probability distributions
The longer the chain runs, the closer it gets to the true distribution
In Bayesian inference, we run multiple Markov chains with different initial values for a preset number of samples
Discard the initial samples (warmup)
What remains is our estimate of the posterior distribution
Multiple chains confirm you reach the same answer regardless of where each chain started

Hamiltonian Monte Carlo (HMC) and Stan

Stan software logo

HMC is the fastest and most efficient MCMC algorithm that has ever been developed
It’s implemented in software called Stan

What is brms?

brms software logo

An easy way to fit Bayesian mixed models using Stan in R
Syntax of brms models is just like lme4
Runs a Stan model behind the scenes
Automatically assigns sensible priors and does lots of tricks to speed up HMC convergence

Bayes Myths: Busted!

Myth Busted

Myth 1. Bayes is confusing
Myth 2. Bayes is subjective
Myth 3. Bayes takes too long
Myth 4. You can’t be both Bayesian and frequentist

Myth 1: Bayes is confusing — BUSTED!

man confused by math

Yes, it can be confusing at first
Largely because frequentist methods were, and still often are, the only ones taught, so Bayesian methods are unfamiliar
Frequentist methods can be just as confusing (focuses on rejecting null hypotheses and requires you to imagine a hypothetical population that may not exist to calculate long-run frequencies of events)

Myth 2: Bayes is subjective — BUSTED!

Mr. Spock the Vulcan, played by Leonard Nimoy

A very old misconception that dates back at least to R. A. Fisher
Bayesian analysis is no more subjective than frequentist
All statistical approaches make assumptions and require prior knowledge
Incorporating prior knowledge is not a bias to be avoided, but a benefit we should embrace
Selectively ignoring prior information = selectively cherry-picking data to include in an analysis

Myth 3: Bayes takes too long — BUSTED! (sort of)

slow sampling

There’s no denying that these models take a lot of computing time and memory
Two ways around it
- Fast algorithms for Monte Carlo sampling that quickly converge on correct solution
- Algorithms that approximate the solution of the integral without having to do full MC sampling
brms uses the first approach, check out INLA for the second approach

Myth 4: You can’t be both Bayesian and frequentist — BUSTED!

Why don’t we have both meme

Each statistical approach is a tool in your toolkit, not a dogma to blindly follow
The differences between Bayesian and frequentist methods are often exaggerated
Model selection, cross-validation, and penalized regression, used in frequentist analysis, have similar effects to use of priors in Bayesian analysis
With the rise of machine learning the boundaries between these traditional approaches are blurring

Why use Bayes?

Some models just can’t be fit with frequentist maximum-likelihood methods
Adding priors makes the computational algorithms work better and get the correct answer
Estimate how big effects are instead of yes-or-no framework of rejecting a null hypothesis
We can say “the probability of something being between a and b is 95%” instead of “if we ran this experiment many times, the estimate would be between a and b 95% of the time.”

Let’s finally fit some Bayesian models!

Setup

Load packages.

library(brms)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(easystats)
library(logspline)

Set plotting theme. Create directory for model fits.

theme_set(theme_minimal())
if (!dir.exists('fits')) dir.create('fits')

Set brms options

options(brms.backend = 'cmdstanr', mc.cores = 4, brms.file_refit = 'on_change')

If not on SCINet/cloud server and you don’t have cmdstanr installed, don’t set cmdstanr as the back-end

options(mc.cores = 4, brms.file_refit = 'on_change')

Example data: Caribbean maize experiment

Bayes on the Maize

The data

Read the maize yield dataset from CSV

stvincent <- read_csv('data/stvincent.csv')

If not on cloud server/SCINet, read from GitHub URL

stvincent <- read_csv('https://usda-ree-ars.github.io/SEAStats/brms_crash_course/stvincent.csv')

Examine the data

St. Vincent island

Experiment testing N, P, and K fertilization effects on maize yield
Done at 9 sites on the Caribbean island of St. Vincent
Incomplete block design, 4 blocks (reps) per site
Modified from example dataset included in agridat package

Map of experimental layout

Column descriptions

site: character ID for each site (environment)
block: character ID for each block (rep), 'B1' through 'B4'. Nested within site
plot: integer ID for each plot
N: character variable with four levels of nitrogen addition treatment from 0N to 3N.
P: character variable with four levels of phosphorus addition treatment from 0P to 3P.
K: character variable with four levels of potassium addition treatment from 0K to 3K.
yield: continuous outcome variable: maize yield of each plot

Exploratory plots

Look at relationship between yield and N using boxplots
I will not explain ggplot2 code for now

(yield_vs_N <- ggplot(data = stvincent, aes(x = N, y = yield)) +
  geom_boxplot() +
  ggtitle('Yield by N fertilization treatment'))

Take multilevel structure of data into account (separate by site).

yield_vs_N +
  facet_wrap(~ site) +
  ggtitle('Yield by N fertilization treatment', 'separately by site')

Fitting models

For reference this is mixed model syntax from lme4 package:

lmer(yield ~ 1 + (1 | site) + (1 | block:site), data = stvincent)

Dependent or response variable (yield) on left side
Tilde ~ separates dependent from independent variables
Here the only fixed effect is the global intercept (1)
Random effects specification ((1 | site)) has a design side (on the left hand) and group side (on the right hand) separated by |.
In this case, the 1 on the design side means only fit random intercepts and no random slopes
site on the group side means each site will get its own random intercept
block:site means each unique combination of block and site will get its own random intercepts

Our first Bayesian multilevel model!

fit_interceptonly <- brm(yield ~ 1 + (1 | site) + (1 | block:site),
                         data = stvincent,
                         chains = 2,
                         iter = 200,
                         warmup = 100,
                         init = 'random',
                         seed = 1)

Same formula as lme4 but with some extra instructions for the HMC sampler
- Number of Markov chains
- Iterations for each chain
- How many iterations to discard as warmup
- Random initial values
- No priors specified, so defaults are used

Model output

summary(fit_interceptonly)

Look at Rhat value for each parameter
$\hat{R}$: ratio of variance between chains to variance within chains
Rhat indicates convergence of MCMC chains, approaching 1 at convergence
Rhat < 1.01 is ideal

Check convergence: R-hat

max(rhat(fit_interceptonly))

Some Rhat values are > 1.05

Check convergence: trace plots

plot(fit_interceptonly)

Posterior distributions and trace plots for

the fixed effect intercept (b_Intercept)
the standard deviation of the random block intercepts (sd_block:site__Intercept)
the standard deviation of the random site intercepts (sd_site__Intercept)
the standard deviation of the model’s residuals (sigma)

Dealing with convergence problems

Either increase iterations or set stronger priors
Increase iterations to 1000 warmup, 1000 sampling per chain (2000 total)
update() lets us draw more samples without recompiling code

fit_interceptonly_moresamples <- update(fit_interceptonly, chains = 2, iter = 2000, warmup = 1000)

plot(fit_interceptonly_moresamples)

summary(fit_interceptonly_moresamples)

Note on terminology

brms says “group-level effects” and “population-level effects”
More commonly called “random effects” and “fixed effects”
Everything is a random variable in a Bayesian analysis so some people try to avoid the fixed/random terminology
I will still call them fixed and random

Credible intervals

What is the “95% CI” thing on the parameter summaries?
credible interval, not confidence interval
more direct interpretation than confidence interval:
- We are 95% sure the parameter’s value is in the 95% credible interval
based on quantiles of the posterior distribution

Calculating credible intervals

Median and 90% quantile-based credible interval (QI) of the intercept

post_samples <- as_draws_df(fit_interceptonly_moresamples)
post_samples_intercept <- post_samples$b_Intercept

median(post_samples_intercept)
quantile(post_samples_intercept, c(0.05, 0.95))

as_draws_df() gets all posterior samples for all parameters and puts them into a data frame

Literally anything in a Bayesian model has a posterior distribution, so anything can have a credible interval!
In frequentist models, you have to do bootstrapping to get that kind of interval on most quantities

Variance decomposition

Proportion of variation at different nested levels
re_formula argument specifies which random effect level we are estimating the variance of

variance_decomposition(fit_interceptonly_moresamples, re_formula = ~ (1 | site))
variance_decomposition(fit_interceptonly_moresamples, re_formula = ~ (1 | block:site))

Proportion of variance for site is very high so there is a need for a multilevel model
I would recommend including both block and site, if that’s the way your study was designed

Mixed-effects model with a single fixed effect

So far we have only calculated mean of yield and random variation by site and block
But what factors influence yield?
Add fixed effect of N fertilization treatment

Fixed-effect part of model formula is 1 + N
No random slope (effect of N fertilization on yield is the same at each site)
Still using default priors

fit_fixedN <- brm(yield ~ 1 + N + (1 | site) + (1 | block:site),
                  data = stvincent,
                  chains = 4,
                  iter = 2000,
                  warmup = 1000,
                  seed = 2,
                  file = 'fits/fit_fixedN')

seed sets a random seed for reproducibility
file creates a .rds file in the working directory so you can reload the model later without rerunning
Good practice to use 4 Markov chains — we can be more confident of our answer if all 4 converge

summary(fit_fixedN)

Low Rhat (the model converged)
Posterior distribution mass for fixed effects is well above zero
sigma (SD of residuals) is smaller than before because we’re explaining more variation

Modifying priors

prior_summary() shows what priors were used to fit the model

prior_summary(fit_fixedN)

t-distributions on intercept, random effect SD, and residual SD (sigma)
mean of intercept prior is the mean of the data
mean of the variance parameters is 0 but lower bound is 0 (half bell curves)

Priors on fixed effect slopes

By default they are flat
Assigns equal prior probability to any possible value
0 is as probable as 100000 which is as probable as -55555, etc.
Not very plausible
It is OK in this case because the model converged, but often it helps convergence to use priors that only allow “reasonable” values

Refitting with reasonable fixed-effect priors

normal(0, 10) is a good choice
Mean of 0 means we think that positive effects are just as likely as negative
SD of 10 means we are still assigning pretty high probability to large effect sizes
Use prior() to assign a prior to each class of parameters
class = b applies prior to all fixed effect slopes (betas)

fit_fixedN_priors <- brm(yield ~ 1 + N + (1 | site) + (1 | block:site),
                         data = stvincent,
                         prior = c(prior(normal(0, 10), class = b)),
                         chains = 4,
                         iter = 2000,
                         warmup = 1000,
                         seed = 2,
                         file = 'fits/fit_fixedN_priors')

summary(fit_fixedN_priors)

Very modest but noticeable effect on the fixed effect estimates
Always report what priors you used!

Posterior predictive check

pp_check() is a useful diagnostic for how well the model fits the data

pp_check(fit_fixedN_priors)

black line: density plot of observed data
blue lines: density plot of predicted data from 10 random draws from the posterior distribution
Fairly good but the data have two humps that the model doesn’t capture
That may be OK because we don’t want to overfit

Plotting posterior estimates

summary() only gives us the median and 95% credible interval
We can work with the full uncertainty distribution (nothing special about 95%)
Functions from tidybayes used to make tables and plots
gather_draws() makes a data frame from the posterior samples of parameters that we choose

posterior_slopes <- gather_draws(fit_fixedN_priors, b_Intercept, b_N1N, b_N2N, b_N3N)

median_qi() gives us median and quantiles of the parameters

posterior_slopes %>%
  median_qi(.width = c(.66, .95, .99))

Estimated marginal means

We have intercept (mean of control group) and three slopes (difference between control and each other level)
We can use them to get predictions of mean yield in each of the three other treatment levels
Marginal mean is estimated for each draw from the posterior distribution

Estimated marginal means: example “by hand”

Add posterior samples of intercept (mean of 0 N treatment) and slope for 1 N (difference between 0 N and 1 N) to get mean of 1 N treatment
Calculate median and 95% quantile credible interval

posterior_mean <- pivot_wider(posterior_slopes, names_from = .variable, values_from = .value) %>%
  mutate(mean_1N = b_Intercept + b_N1N)

median_qi(posterior_mean$mean_1N)

Estimated marginal means: use `emmeans()`

First argument is the model fit
Second argument ~ N tells it to get means for each N level

N_emmeans <- emmeans(fit_fixedN_priors, ~ N)

Differences between each pair of means

Subtract the posterior samples of one mean from another
Done with contrast() function, specifying method = 'pairwise' to get all pairwise comparisons

contrast(N_emmeans, method = 'pairwise')

Special extensions to ggplot2 for plotting quantiles of posterior distribution
gather_emmeans_draws() is used to put posterior samples from emmeans object into a dataframe

post_emm_draws <- gather_emmeans_draws(N_emmeans)

ggplot(post_emm_draws, aes(y = N, x = .value)) +
  stat_halfeye(.width = c(.8, .95)) +
  labs(x = 'estimated marginal mean yield')

ggplot(post_emm_draws, aes(y = N, x = .value)) +
  stat_interval() +
  stat_summary(fun = median, geom = 'point', size = 2) +
  scale_color_brewer(palette = 'Blues') +
  labs(x = 'estimated marginal mean yield')

Model with multiple fixed effects

Fixed-effect part is now 1 + N + P + K

fit_fixedNPK <- brm(yield ~ 1 + N + P + K + (1 | site) + (1 | block:site),
                    data = stvincent,
                    prior = c(prior(normal(0, 10), class = b)),
                    chains = 4,
                    iter = 2000,
                    warmup = 1000,
                    seed = 3,
                    file = 'fits/fit_fixedNPK')

Look at trace plots, posterior predictive check, and model summary.

What do they show?

Model with random slopes

So far we’ve assumed any predictor’s effect is the same at every site (only intercept varies, not slope)
Add random slope term to allow both intercept and slope to vary
Specify a random slope by adding the appropriate slope to the design side of the random effect specification
- random intercept only (1 | site)
- random intercept and random slope with respect to N (1 + N | site)
- random intercept and random slope with respect to N, P, and K (1 + N + P + K | site)
Not enough data for block-level random slopes

fit_randomNPKslopes <- brm(yield ~ 1 + N + P + K + (1 + N + P + K | site) + (1 | block:site),
                           data = stvincent,
                           prior = c(prior(normal(0, 10), class = b)),
                           chains = 4,
                           iter = 2000,
                           warmup = 1000,
                           seed = 4,
                           file = 'fits/fit_randomNPKslopes')

Look at trace plots, pp_check, and model summary
Lots of parameters because random intercept and three random slopes with three levels each all have variances and covariances

Predictions for N response from random slope model

We’re now averaging over random intercepts by site, random slopes by site, block effects within site, and the other fixed effects

N_emmeans <- emmeans(fit_randomNPKslopes, ~ N)

N_contrasts <- contrast(N_emmeans, 'pairwise')

Smaller differences between point estimates, and wider credible intervals

Plot of site-level variation in N responses

add_epred_draws() from tidybayes generates predictions for any combination of fixed and random effects, so we can plot their posterior distributions
epred: expectation of the posterior predictive
Predictions for each level of N addition averaged over block effects, incorporating site random effects, and at control levels of P and K

N_epred <- expand_grid(N = unique(stvincent$N),
                       P = '0P',
                       K = '0K',
                       site = unique(stvincent$site)) %>%
  add_epred_draws(fit_randomNPKslopes, re_formula = ~ (1 + N | site), value = 'yield')

ggplot(N_epred, aes(x = N, y = yield)) +
  stat_interval() +
  stat_summary(fun = median, geom = 'point', size = 2) +
  stat_summary(aes(x = as.numeric(N)), fun = median, geom = 'line') +
  scale_color_brewer(palette = 'Greens') +
  facet_wrap(~ site)

Shrinkage

The trends within site predicted by the model are less extreme than the raw data would indicate
The mixed model pulls estimates toward the mean, “shrinking” them
This means it predicts better for new data, at the cost of fitting the original data a little less closely

Interactions between fixed effects

You can add interaction terms separated by :
example: 1 + N + P + K + N:P + N:K + P:K
- or shorthand: N*P*K
Left as an exercise

Comparing models with information criteria

Leave-one-out (LOO) cross-validation compares models
How well does a model fit to all data points but one predict the one remaining data point?
First use add_criterion() to compute the LOO criterion for each model
Then use loo_compare() to rank the models

fit_interceptonly_moresamples <- add_criterion(fit_interceptonly_moresamples, 'loo')
fit_fixedN_priors <- add_criterion(fit_fixedN_priors, 'loo')
fit_fixedNPK <- add_criterion(fit_fixedNPK, 'loo')
fit_randomNPKslopes <- add_criterion(fit_randomNPKslopes, 'loo')

loo_compare(fit_interceptonly_moresamples, fit_fixedN_priors, fit_fixedNPK, fit_randomNPKslopes)

Models ranked by ELPD (expected log pointwise predictive density)
The best one always has 0 and the others are ranked relative to it
The random slope model is a much better fit than models that ignore site-level variation in response
The model with no fixed effects does very poorly

Side note: p-values for Bayesian model comparison

Divide elpd_diff/se_diff to get a standardized difference (z-score)
Get p-value with normal approximation: 2 * pnorm(abs(-elpd_diff/se_diff))
Z-score greater than ~2 has p-value < 0.05, if significance is your thing!

Assessing evidence: Bayes Factors

One Bayesian analogue of a p-value
Ratio of evidence between two models: $\frac{P(model_1|data)}{P(model_2|data)}$
Ranges from 0 to infinity
BF = 1 means equal evidence, BF > 1 means more evidence for model 1
No “significance” threshold but BF > 10 is usually called strong evidence

Bayes factors for pairwise means comparisons

R package bayestestR lets us compute BF for individual parameters or estimates in the model
Ratio of evidence for posterior : evidence for prior
Our prior distributions were centered at 0 so they are like “null hypotheses”
BF = 1 means we did not change our belief about the parameter at all from the prior, after seeing the data
BF > 1 means we’ve changed our belief about the parameter after seeing the data
BF < 1 means we have even stronger evidence that the prior is true, after seeing the data

We want to test the hypotheses that there are differences between yield means for each of the pairs of N treatments
Get prior samples using unupdate() and compute the prior distributions of the differences
Compare these to the posterior differences we calculated earlier

fit_randomNPKslopes_prior <- unupdate(fit_randomNPKslopes)
N_emmeans_prior <- emmeans(fit_randomNPKslopes_prior, ~ N)
N_contrasts_prior <- contrast(N_emmeans_prior, 'pairwise')

bayesfactor_parameters(posterior = N_contrasts, prior = N_contrasts_prior)

BF shows moderately strong evidence for a difference between 0 N and 2 N
BF < 1 for all other pairs
WARNING: BFs are very sensitive to your choice of prior and require many samples to be precise

Assessing evidence: Probability of direction

$p_d$: Proportion of posterior distribution on the same side of the null as our point estimate
If null = 0 (the default), it’s the proportion of samples with the same sign as the median
Use argument as_p = TRUE to get two-sided frequentist-style p-value $2(1 - p_d)$

pd(N_contrasts, null = 0, as_p = TRUE)

Assessing evidence: MAP-based “p-value”

$p_{MAP}$: Odds against a “point value” representing a null hypothesis
Compare posterior density at this null value with posterior density at the mode of the posterior distribution
“Maximum a posteriori” = MAP
Use p_map(), specifying null value (default is 0)
$p_d$ and $p_{MAP}$ not as sensitive to prior as BF, but sensitive to how the density is estimated

p_map(N_contrasts, null = 0)

Conclusion

What did we learn? Let’s revisit the learning objectives!

Conceptual learning objectives

You now understand…

The basics of Bayesian inference
Definition of prior, likelihood, and posterior
How Markov Chain Monte Carlo works
What a credible interval is

Practical skills

You now can…

Write brms code to fit a multilevel model with random intercepts and random slopes
Diagnose and deal with convergence problems
Interpret brms output
Compare models with LOO information criteria
Use Bayes factors and “Bayesian p-values” to assess strength of evidence for effects
Make plots of model parameters and predictions with credible intervals

Congratulations, you are now Bayesians!

Meme by Tristan Mahr

See text version of lesson for further reading and useful resources
Please provide feedback to help me improve this course!
- See course homepage for link to feedback form

A crash course in Bayesian mixed models with brms (Lesson 1)

What is this class?

Minimal prerequisites

Advanced prerequisites

How to follow the course

Conceptual learning objectives

Practical learning objectives

What is Bayesian inference?

What is Bayesian inference?

Bayes’ Theorem

Bayes’ Theorem

Bayes’ Theorem

Bayes’ theorem and statistical inference

Bayes’ theorem and statistical inference

Bayes’ theorem and statistical inference

Restating Bayes’ theorem

Example

Bayesian vs. frequentist probability

Bayes is computationally intensive

Markov Chain Monte Carlo (MCMC)

Hamiltonian Monte Carlo (HMC) and Stan

What is brms?

Bayes Myths: Busted!

Myth 1: Bayes is confusing — BUSTED!

Myth 2: Bayes is subjective — BUSTED!

Myth 3: Bayes takes too long — BUSTED! (sort of)

Myth 4: You can’t be both Bayesian and frequentist — BUSTED!

Why use Bayes?

Setup

Example data: Caribbean maize experiment

The data

Examine the data

Map of experimental layout

Column descriptions

Exploratory plots

Fitting models

Our first Bayesian multilevel model!

Model output

Check convergence: R-hat

Check convergence: trace plots

Dealing with convergence problems

Note on terminology

Credible intervals

Calculating credible intervals

Variance decomposition

Mixed-effects model with a single fixed effect

Modifying priors

Priors on fixed effect slopes

Refitting with reasonable fixed-effect priors

Posterior predictive check

Plotting posterior estimates

Estimated marginal means

Estimated marginal means: example “by hand”

Estimated marginal means: use emmeans()

Differences between each pair of means

Model with multiple fixed effects

Model with random slopes

Predictions for N response from random slope model

Plot of site-level variation in N responses

Shrinkage

Interactions between fixed effects

Comparing models with information criteria

Side note: p-values for Bayesian model comparison

Assessing evidence: Bayes Factors

Bayes factors for pairwise means comparisons

Assessing evidence: Probability of direction

Assessing evidence: MAP-based “p-value”

Conclusion

Conceptual learning objectives

Practical skills

Estimated marginal means: use `emmeans()`