Introduction

What is this course? It is a crash course in brms: a brief and practical introduction to fitting Bayesian multilevel models in R and Stan using the brms package (Bayesian Regression Models using Stan). Because this is intended as a practical skills course, I will only give a quick and surface-level description of Bayesian analysis and how it can be used for multilevel regression models.

Download the worksheet for this lesson here.

What should you know coming into this course?

This course is designed for practicing researchers who have some experience with statistical analysis and coding in R. Don’t worry if you have zero experience with Bayesian statistics — that’s what I’m here to teach you!

Minimal prerequisites

You should have at least heard of a mixed model or multilevel model.
You should have at least tried to do stats and/or data science in R before.

Advanced prerequisites

If you have experience with the R package lme4, that’s great because the syntax of brms is modeled after lme4.
If you have experience with the tidyverse packages, including ggplot2, for manipulating and plotting data, that’s also great because we will be using them in this course.

If you don’t know what those packages are, don’t sweat it because you can just follow along by copying the code.

What will you learn from this course?

Conceptual learning objectives

At the end of this course, you will understand …

The very basics of Bayesian inference
What a prior, likelihood, and posterior are in the context of a Bayesian model
The very basics of how Markov Chain Monte Carlo works
What a credible interval is

Practical skills

At the end of this course, you will be able to …

Write the code to specify and sample a multilevel model with random intercepts and random slopes in brms
Generate plots and statistics to make sure your model converged
Interpret the summary output of a multilevel model in brms
Compare different models with leave-one-out information criteria
Use Bayes factors to assess strength of evidence for effects
Make plots of model parameters and predictions, with credible intervals to show uncertainty

What is Bayesian inference?

portrait presumed to be Thomas Bayes
This may or may not actually be a portrait of Thomas Bayes

In its simplest form, Bayesian inference is a method of statistical inference that allows you to use information you already know to assign a prior probability to a hypothesis, then update the probability of that hypothesis as you get more information. If that seems like common sense to you, then you’re starting to understand the appeal of the Bayesian way of doing things!

It is a very powerful tool that has been applied across many fields of human endeavor. Today we are only going to look at its application in estimating the parameters of multilevel statistical models to analyze scientific data, but that’s only one of the many things we can use Bayesian inference for. Bayesian methods are only growing in popularity, thanks in large part to the rapid growth of user-friendly, open-source, computationally powerful software – like Stan and its companion R package brms that we are going to learn about today!

Bayesian inference is named after Reverend Thomas Bayes, an English clergyman, philosopher, and yes, statistician, who wrote a scholarly work on probability (published after his death in 1763). His essay included an algorithm to use evidence to find the distribution of the probability parameter of a binomial distribution … using what we now call Bayes’ Theorem! However, much of the philosophical and mathematical foundations of Bayesian inference were really developed by Pierre-Simon Laplace independently of Bayes and at roughly the same time. Give credit where it’s due!

Bayes’ Theorem

photo of a neon sign of Bayes’ Theorem

Bayes’ Theorem is the basis of Bayesian inference. It is a rule that describes how likely an event is to happen (its probability), based on our prior knowledge about conditions related to that event. In other words, it describes the conditional probability of an event A occurring, conditioned on the probability of another event B occurring.

The mathematical statement of Bayes’ theorem is:

\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]

This means the probability of A being true given that B is true ($P(A|B)$) is equal to the probability that B is true given that A is true ($P(B|A)$) times the ratio of probabilities that A and B are true ($\frac{P(A)}{P(B)}$).

This may not make sense yet but some more concrete examples might help.

How Bayes’ theorem relates to statistical inference

Let’s say $A$ is a statistical model. A statistical model is a hypothesis about the world. We want to know the probability that our hypothesis is true. We start out with some prior knowledge about the probability that our model $A$ is true. So $P(A)$ is called the prior probability. We go out there and get some data $B$ to test our hypothesis. We can calculate the probability of observing that data $B$, given that $A$ is true: $P(B|A)$. This is also known as the likelihood of the model $A$, in light of our observations $B$. We use this to update our estimate of the probability that $A$ is true. We now have $P(A|B)$, the probability our model, or hypothesis, is true, given the data we just observed.

For now we can ignore the denominator, $P(B)$, the probability that the data are true. This is called the marginal probability. Usually the best we can do is to compare the relative probability of different models, given the data. We can’t really say (nor should we say) what the absolute probability of a model is. (“All models are wrong.”) If we are comparing two models $A_1$ and $A_2$, both fit using the same data $B$, and we take the ratio $\frac{P(A_1|B)}{P(A_2|B)} = \frac{P(B|A_1)P(A_1)/P(B)}{P(B|A_2)P(A_2)/P(B)}$, the two $P(B)$s cancel out. So we can temporarily ignore the denominator and write Bayes’ theorem as

\[P(model|data) \propto P(data|model)P(model)\]

\[posterior = likelihood \times prior\]

or, flipping the left and right sides around,

what we believed before about the world (prior) × how much our new data changes our beliefs (likelihood) = what we believe now about the world (posterior)

That’s pretty cool!

A “real-life” example of Bayes’ theorem

Let’s say we pick up a coin off the street. We have a very strong prior belief that the true probability of flipping heads on any given flip is 0.5. Let’s say we flipped the coin 10 times and got 8 heads. We probably wouldn’t change that belief very much based on that evidence. It simply isn’t enough evidence to sway us off that belief. The likelihood of getting 8 heads, given that the true probability is 0.5, is still high enough that our posterior estimate of the probability is still around 0.5.

But let’s say a shady character on the street corner approaches us, shows us a coin, and offers to flip the coin 10 times. He offers to pay us $1 for every tails if we pay him $1 for every heads. Most likely, before we even got any new evidence, we’d be likely to think that the true probability of heads is not exactly 0.5. Or at least we would have a more skeptical prior belief, and we would be willing to entertain the possibility that the true probability of heads might be higher than 0.5. So now if we observe 8 heads, we might update our estimate of the true probability of heads to something higher, maybe 0.7.

photo of a magician who can control coin flips
Our shady character might’ve studied this YouTube video.

In a classical, “frequentist” analysis, there is no formal way to incorporate prior information into a statistical model. In both cases, we have no choice but to say the point estimate of the true probability of heads is 0.8.

The critical point here is that probability, in the Bayesian interpretation, incorporates our uncertain knowledge about an event. For example we might say that the probability of the stock market going up tomorrow is 55%. In the frequentist interpretation, either the stock market will go up tomorrow or it will not. Frequentist probability is just a frequency: if we hypothetically repeated “tomorrow” under the same conditions many times, 55% of the time the stock market would go up. But Bayesian probability can handle the uncertainty. When you calculate a Bayesian probability you are giving a number to a belief, a number that best reflects the state of your knowledge.

Why is Bayes so computationally intensive?

The coin flip example was pretty easy — you can do it in your head! Why do we need such advanced computational tools to do Bayesian inference?

The main reason that Bayesian statistics were not widely adopted initially is probably a practical one. It boils down to $P(data)$, the marginal probability. We need it to estimate $P(model|data)$, the posterior. You may have heard of “maximum likelihood” methods. Maximum likelihood algorithms are used to fit frequentist models. They try to find the set of parameters that make the data the most probable. In the Bayesian world, we are still trying to do that, but instead of getting a single point estimate of the parameters that are most likely given the data, we are now trying to estimate a joint distribution over all possible combinations of values the parameters could take, then normalize it by dividing by the marginal probability. We end up with a distribution for all the parameters that is most likely given the data we observed.

If our “model” is just a single parameter like the probability of a coin flip being heads, and our “data” is just a count of heads out of 10, the calculations are trivial. But what if our model is a multilevel regression model with a bunch of fixed effects, a bunch of random intercepts at multiple levels, and variance-covariance parameters for all the errors (like we are about to fit in this very lesson)? We might have hundreds of parameters. To estimate $P(data)$, we have to integrate a function across n-dimensional space, where n is the total number of parameters in the model!

If you remember even trying to do two-dimensional integration on paper from calculus class, imagine doing 100-dimensional integration or even more! There just isn’t enough time in the universe to get a precise solution for an integral with that many dimensions. In the time of Thomas Bayes, and for hundreds of years after that, Bayesian statistics simply couldn’t be applied to real-world problems because of this computational limitation.

Markov Chain Monte Carlo

That’s where Markov Chain Monte Carlo (MCMC), aided by super-fast modern microprocessors, comes in. What is MCMC? It’s a class of algorithms for drawing samples from a probability distribution. The longer the algorithm runs, the closer it gets to the true distribution. I won’t describe it in detail here but the best beginner-level description of MCMC I’ve ever read is in Richard McElreath’s book Statistical Rethinking.

Because fitting complicated Bayesian models requires you to characterize a very high-dimensional probability distribution (the posterior), most Bayesian inference uses MCMC (although there is an alternative way to approximate Bayesian inference). Usually in practice you run multiple Markov chains for a prespecified number of samples, discard some samples from the beginning of the chains when they were still searching for the correct solution, and retaining only the samples that you think characterize the true posterior distribution. Running multiple chains that start at different initial values is a good idea. If more than one chain “converges” on the same distribution even though they started at different points, that’s a good indication that you’ve found the correct distribution.

What is Stan?

Stan software logo

MCMC is a class of algorithms — there’s more than one way to do it. In 1987, some physicists proposed a new MCMC algorithm called Hamiltonian Monte Carlo (HMC) to do calculations in quantum physics. I won’t describe HMC here, except to say that it really works! Older MCMC algorithms are not nearly as efficient as HMC. HMC can converge on a solution ten times faster, if not more, than other algorithms. It makes Bayesian inference so much more practical. Recently, Andrew Gelman and colleagues developed Stan, which is a software platform for Bayesian statistical modeling using HMC. The models we will fit with brms in this lesson use Stan for back-end computation.

animated GIF of Hamiltonian Monte Carlo trajectory, by GitHub user chi-feng
Animation of HMC in action exploring a probability distribution (image by GitHub user chi-feng)

Because HMC is now the state of the art for estimating high-dimensional integrals for Bayesian inference, it is not only implemented in Stan. It is also implemented in SAS PROC BGLIMM and MATLAB, among other stats packages, because it is just so damn good.

What is brms?

brms software logo

One hurdle for getting more people to use Stan is that you have to write quite a lot of code to specify a model in Stan. There are a lot of “tricks” that Stan users have developed to optimize their models and make them run faster. It can be hard to figure out how to implement all those tricks in the context of your statistical model. Luckily, a statistician named Paul Bürkner developed an R package called brms. brms lets you set up regression models in R with very simple syntax. Behind the scenes, it translates your R code into Stan code, complete with all the tricks of scaling parameters, putting sensible prior distributions on parameters, and then runs the Stan model for you.

Out of the thousands of R packages out there, brms is far and away my favorite. Paul is extremely dedicated to his work. He has been very responsive to any issues I’ve raised and has added new features to brms at my request. Danke, Paul!

We are going to use brms to fit Bayesian linear mixed models today.

Myth Busted

Bayesian Myths Busted

Before we get cracking on the code, let’s go over some popular misconceptions about Bayes.

Myth 1. Bayes is confusing
Myth 2. Bayes is subjective
Myth 3. Bayes takes too long
Myth 4. You can’t be both Bayesian and frequentist

$man confused by math$

Myth 1: Bayes is confusing

BUSTED! I definitely empathize that some of the Bayesian approach is confusing at first. It took me many repetitions of learning the basic concepts through classes and self-study to understand it, and I’m still working on it. But I think this is really more because frequentist techniques were the only ones taught for many decades. You could even argue that the frequentist approach, which focuses on rejecting null hypotheses and requires you to imagine a hypothetical population that may not exist to calculate long-run frequencies of events, is more confusing and less intuitive than the Bayesian approach.

Mr. Spock the Vulcan, played by Leonard Nimoy

Myth 2: Bayes is subjective

BUSTED! This is a particularly dangerous misconception. The statistician R. A. Fisher, who invented many frequentist techniques, used accusations of subjectivity to discredit Bayesian statisticians. Because Fisher was so influential, everyone listened to him and Bayesian statistics fell out of favor for decades. Luckily, the pendulum is swinging back the other way.

In fact, Bayesian analysis is no more subjective than frequentist analysis. It’s not possible to be completely objective — we are human beings, not Vulcans! All statistical approaches make assumptions and require prior knowledge. Also, I would argue that incorporating prior information is not a subjective bias to be avoided, but a good thing we should strive for. Ignoring prior information could be just as bad as selectively cherry-picking data to include in an analysis. Either way, you are throwing out information that might not fit your story.

slow sampling

Myth 3: Bayes takes too long

BUSTED … sort of. Yes, there’s really no denying that Bayes is very computationally intensive, requiring lots of computing time and memory. There are two main ways around that. One is to come up with really smart algorithms for Monte Carlo sampling that quickly converge on the correct solution. The other is to come up with other kinds of cool algorithms that approximate the solution of the integral without having to do the full MC sampling. brms uses the first approach, but packages such as INLA follow the second strategy, which is especially useful for fitting complex geospatial models.

Myth 4: You can’t be both Bayesian and frequentist

BUSTED! It is important not to get too attached to one or the other statistical approach. Both have advantages and disadvantages and it’s best to see them as different tools in your toolkit. If you treat it like dogma, it will harm your science. What is more, the differences between Bayesian and frequentist analysis are a little bit overblown. Techniques like model selection, cross-validation, and penalized regression that are becoming more and more common across all branches of statistics, especially with the rise of machine learning, have a very similar effect to priors in a Bayesian analysis.

Why use Bayes?

After all that, you might be asking yourself, if it’s so hard to do Bayesian inference, why bother? Maximum likelihood frequentist methods work really well! Well, they do, up to a point. There are some kinds of multilevel model that frequentist methods have a lot of trouble estimating the parameters for. Sometimes the maximum likelihood algorithms fail to find a solution. So even if you aren’t completely convinced by the philosophical reasons to use Bayesian inference, it can be very practical from a computational standpoint. Adding prior information helps stabilize model fitting algorithms, so you can estimate models with Bayesian methods that simply can’t be estimated any other way.

Philosophically, there is another big advantage to Bayesian inference. Classical statistical methods are all based around rejecting a null hypothesis. This is very reductionist because it reduces real-world questions about how strong an effect is to a black-and-white yes-or-no answer. Even if you don’t worry about the $p<0.05$ threshold of significance, you are still testing a null hypothesis rather than the really interesting hypothesis you want to test. In Bayesian inference, we use the posterior distribution to say what we believe the size of the effect is. Bayesian inference lets us directly say “the probability that the effect is between X and Y is 95%” while in a frequentist analysis you have to say “if we ran the experiment again many times, our estimate of the effect would be between X and Y 95% of the time.”

In summary, Bayesian analysis lets us leverage our prior knowledge, more honestly describe uncertainty, test the hypothesis we really want to test, and more efficiently get correct answers to very complicated statistical questions.

I think I have gone on long enough talking about why to use Bayesian inference. Let’s try to do some Bayesian modeling ourselves with brms to show how this powerful method works in practice!

Setup

You should have already installed the R package brms and all its dependencies. We will also be using the tidyverse, emmeans, tidybayes, and easystats packages today. Also note the logspline package is necessary to have installed although we will not be calling any functions from it directly. Install all those packages if you haven’t yet.

library(brms)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(easystats)
library(logspline)

Set a plotting theme. Create a directory called fits in your current working directory to store fitted model objects, if it does not already exist.

theme_set(theme_minimal())
if (!dir.exists('fits')) dir.create('fits')

Set brms options. The option mc.cores = 4 means we will use four processor cores in parallel to fit the models, meaning we can allow four Markov chains to sample simultaneously. The options brms.file_refit = 'on_change' means that if a model has already been fit and saved to a file, and the code and data are unchanged, the model will be reloaded from the file instead of having to fit it again. If you are running this lesson on SCINet or the cloud server, there are some pre-fit model objects to save time during the workshop.

options(mc.cores = 4, brms.file_refit = 'on_change')

If you are on SCINet or the cloud server, or you’ve installed CmdStan on your own computer, set the options below. This will use the faster and less memory-intensive cmdstanr package to fit the models behind the scenes. Unfortunately, some USDA users have had trouble installing the CmdStan software that cmdstanr uses on USDA laptops, but it can be done. Check out instructions for installing CmdStan here, and if those don’t work, instructions for a workaround to install CmdStan using Anaconda.

options(mc.cores = 4, brms.file_refit = 'on_change', brms.backend = 'cmdstanr')

The data

Read the data from a CSV file to a data frame. It is hosted on the GitHub repository for this course. Alternatively, if you are doing this course on SCINet or the cloud server, you may read the data from the local data directory.

stvincent <- read_csv('https://usda-ree-ars.github.io/SEAStats/brms_crash_course/stvincent.csv')

stvincent <- read_csv('data/stvincent.csv')

St. Vincent island

View of St. Vincent (image from Wikimedia Commons)

Let’s examine the data. These data come from the agridat package which has tons of great example datasets from agricultural science. I’ve modified it slightly for this workshop. The data come from a maize fertilization trial on the Caribbean island of St. Vincent. It is a block design repeated at nine sites. At each site, there were four replicates (blocks). Each block contained nine plots each of which received a different combination of nitrogen (N), phosphorus (P), and potassium (K) fertilizer. It is an incomplete block design because there were actually twelve different treatment combinations and each block only had a subset of those combinations, with some combinations represented in multiple plots within each block. Some treatment combinations were replicated more than others. Maize yield was measured in each plot.

The schematic below illustrates the experimental design, with nine of the twelve different treatment combinations occurring in each block, and each site having four blocks. Two of the nine sites are shown.

glimpse(stvincent)

## Rows: 324
## Columns: 7
## $ site  <chr> "CPSV", "CPSV", "CPSV", "CPSV", "CPSV", "CPSV", "CPSV", "CPSV", …
## $ block <chr> "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B2", "B2"…
## $ plot  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ N     <chr> "0N", "1N", "1N", "0N", "2N", "2N", "1N", "1N", "3N", "1N", "3N"…
## $ P     <chr> "2P", "3P", "1P", "0P", "0P", "2P", "1P", "1P", "1P", "1P", "1P"…
## $ K     <chr> "2K", "1K", "1K", "0K", "2K", "0K", "1K", "3K", "1K", "1K", "1K"…
## $ yield <dbl> 24.0, 27.1, 26.5, 23.1, 22.1, 24.1, 26.1, 22.5, 23.0, 26.4, 24.8…

Here is a description of the columns:

site: a character ID for each site (i.e., environment) where responses were measured. There are nine sites.
block: a character ID for each block, or rep. Each site contains four blocks labeled 'B1' through 'B4'. Note that the IDs are nested, meaning the same ID is reused from site to site. For example 'B1' at the site 'CPSV' is not the same block as 'B1' at 'CASV', though they share a label.
plot: an integer ID for each plot. The numbering begins at 1 for each site.
N: a character variable with four discrete levels of nitrogen addition treatment from 0N to 3N.
P: a character variable with four discrete levels of phosphorus addition treatment from 0P to 3P.
K: a character variable with four discrete levels of potassium addition treatment from 0K to 3K.
yield: a continuous variable indicating the maize yield at the plot level. This is the outcome variable we are interested in explaining.

In a study like this, we might expect levels of the different nutrients to affect yield. In addition, we might expect unmeasured variables at both the field level and the block level to affect yield. Further, we might want to see whether there is a difference in the strength of the relationship between, say, nitrogen and yield among the different environments (sites). Thus this is an ideal situation for a multilevel model!

Here we say that block is nested within site, and plot is nested within block. This is because the blocks at each site are unique, as are the plots within each block.

Let’s change all the blocking and discrete treatment variables to factors, to make it clear what they represent.

stvincent <- stvincent %>%
  mutate(across(c(site, block, plot, N, P, K), factor))

Exploratory plots

Let’s take a look at the relationship between N fertilization and yield, using box plots. (I won’t explain the ggplot2 code in this lesson.)

(yield_vs_N <- ggplot(data = stvincent, aes(x = N, y = yield)) +
  geom_boxplot() +
  ggtitle('Yield by N fertilization treatment'))

simple box plot of yield versus soil N

Right now, this completely ignores the multilevel structure of the dataset. The yield values of plots from the same field are not independent of one another. Ignoring that wrinkle and eyeballing the graph, it looks like there’s a reasonably large effect of N fertilization on yield, especially between adding 0 N and the three fertilization treatments. But there is a lot of variation in yield within each treatment.

Now let’s start to take the nested structure of the data into account. Separate the boxplots by site.

yield_vs_N +
  facet_wrap(~ site) +
  ggtitle('Yield by N fertilization treatment', 'separately by site')

scatter plot of yield versus soil N with points colored by field

You can see that the plot that combined all data points from the different sites together didn’t tell the whole story. The effect of N treatment on yield is very different between the sites. Some sites have a very strong response while others don’t have any response. Also, we can see that the “baseline” yield in unfertilized plots is quite variable among the sites. A multilevel model helps us estimate the overall effect while taking into account the variation between sites. That can mean both how the baseline yield varies by site (random intercept) and how the treatment effect varies by site (random slope).

Let’s also make boxplots by site showing the effect of P and K fertilization on yield.

ggplot(data = stvincent, aes(x = P, y = yield)) +
  geom_boxplot() + facet_wrap(~ site) +
  ggtitle('Yield by P fertilization treatment', 'separately by site')

ggplot(data = stvincent, aes(x = K, y = yield)) +
  geom_boxplot() + facet_wrap(~ site) +
  ggtitle('Yield by K fertilization treatment', 'separately by site')

Generally, we can see from these plots that the effects of P and K fertilization on yield are a little less pronounced. But we can still see some variation in response among sites.

It would also be good to make some plots to look at the interactions between the different fertilizer treatments. I will leave that to you.

Fitting models

Intercept-only model

If you have ever fit a mixed model using lme4, this will look familiar to you:

lmer(yield ~ 1 + (1 | site) + (1 | block:site), data = stvincent)

That is the syntax for fitting a mixed-effects model:

The dependent or response variable (yield) is on the left side of the formula.
A tilde ~ separates the dependent from the independent variables.
Here the only fixed effect is the global intercept (1).
There are two levels of random intercept, one for site and one for block nested within site
The random effects specification (e.g., (1 | site)) has a design side (on the left hand) and group side (on the right hand) separated by |.
In this case, the 1 on the design side means only fit random intercepts and no random slopes.
The site on the group side means each site will get its own random intercept.
In the second random effect term, block:site on the group side means that each unique combination of block and site will get its own random intercept.

brms allows you to fit a Bayesian multilevel model with very similar syntax to the frequentist lme4. But instead of just getting maximum likelihood point estimates of the model parameters (the fixed and random effects), we’re going to get posterior distributions for all of them! Get hyped!!!

But to do this we need to give a few extra directions to the Hamiltonian Monte Carlo sampling algorithm:

How many chains should run?
How many iterations should each chain run in total?
How many iterations should be used for the initial warmup phase (these are not included in the final posterior distribution)?
What are the initial values for the parameters?

If we don’t specify any of these, brms will use default values for number of chains and iterations, and it will assign random values for the initial values.

You may notice I have made no mention of priors yet. Another great thing about brms is that if you do not specify priors, it will assign uninformative priors that are at least plausible for each of the parameters. (For example, if the parameter is a variance parameter that must be positive, it will assign a prior distribution that is truncated at zero). This can be a major time saver but it can also be a dangerous thing. We will revisit priors later. For now we will skip specifying the priors and use the defaults.

Fitting our first model

Without further ado, here is our first Bayesian multilevel model!

Notice we have specified the model formula, the data, and the number of chains (2). Each chain will run for 200 total iterations, of which the first 100 are warmup when the sampler will be determining the optimal “jump” size to make at each iteration. We will get 200 - 100 = 100 draws from the posterior distribution for each chain. Finally, we let the initial values for the parameters all be random between -2 and 2 (default). We set a random seed to ensure that the results are replicable.

fit_interceptonly <- brm(yield ~ 1 + (1 | site) + (1 | block:site),
                         data = stvincent,
                         chains = 2,
                         iter = 200,
                         warmup = 100,
                         init = 'random',
                         seed = 1)

## Running MCMC with 2 chains, at most 4 in parallel...
## 
## Chain 1 WARNING: There aren't enough warmup iterations to fit the 
## Chain 1          three stages of adaptation as currently configured. 
## Chain 1          Reducing each adaptation stage to 15%/75%/10% of 
## Chain 1          the given number of warmup iterations: 
## Chain 1            init_buffer = 15 
## Chain 1            adapt_window = 75 
## Chain 1            term_buffer = 10 
## Chain 1 Iteration:   1 / 200 [  0%]  (Warmup) 
## Chain 2 WARNING: There aren't enough warmup iterations to fit the 
## Chain 2          three stages of adaptation as currently configured. 
## Chain 2          Reducing each adaptation stage to 15%/75%/10% of 
## Chain 2          the given number of warmup iterations: 
## Chain 2            init_buffer = 15 
## Chain 2            adapt_window = 75 
## Chain 2            term_buffer = 10 
## Chain 2 Iteration:   1 / 200 [  0%]  (Warmup) 
## Chain 1 Iteration: 100 / 200 [ 50%]  (Warmup) 
## Chain 1 Iteration: 101 / 200 [ 50%]  (Sampling) 
## Chain 2 Iteration: 100 / 200 [ 50%]  (Warmup) 
## Chain 2 Iteration: 101 / 200 [ 50%]  (Sampling) 
## Chain 2 Iteration: 200 / 200 [100%]  (Sampling) 
## Chain 1 Iteration: 200 / 200 [100%]  (Sampling) 
## Chain 1 finished in 0.3 seconds.
## Chain 2 finished in 0.3 seconds.
## 
## Both chains finished successfully.
## Mean chain execution time: 0.3 seconds.
## Total execution time: 0.4 seconds.

The model takes a minute or two to compile the C++ code but only two or three seconds to sample.

Use the summary() function to see the output.

summary(fit_interceptonly)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: yield ~ 1 + (1 | site) + (1 | block:site) 
##    Data: stvincent (Number of observations: 324) 
##   Draws: 2 chains, each with iter = 200; warmup = 100; thin = 1;
##          total post-warmup draws = 200
## 
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.97      0.43     0.16     1.70 1.04       57       28
## 
## ~site (Number of levels: 9) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     4.78      1.39     2.87     8.63 1.03       73       87
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    16.12      1.48    13.10    18.53 1.08       21       55
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     3.94      0.16     3.63     4.24 1.01      173      184
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Diagnosing convergence problems

Before we can interpret any of the model’s estimates, we must confirm that the Markov chains converged. Convergence happens when all the chains, regardless of where they started, reach the same distribution and stay there. If the chains get stuck in different local maxima of the likelihood, or are still searching around the parameter space to find the highest likelihood, they haven’t converged and we can’t learn anything from the posterior distributions the model gives us.

A good way to check whether the model converged is to look at the maximum $\hat{R}$, or “R-hat”, statistic for all the parameters of the model.

max(rhat(fit_interceptonly))

## [1] 1.083413

The largest value is greater than 1.05. What does this mean? R-hat is a statistic used to roughly indicate when multiple MCMC chains have converged on the same distribution. It is essentially a ratio of variance between chains to variance within chains. As chains approach convergence, the R-hat statistic approaches 1. It is ideal for R-hat values for all parameters of a model to be < 1.01, but at a bare minimum we need to have all R-hat values < 1.05.

We can also plot() the model object to get density plots and trace plots of the posterior distribution (the 100 post-warmup samples for each Markov chain).

plot(fit_interceptonly)

convergence diagnostic plots for intercept-only model

What does this show? We see diagnostics for the four most important parameters of the model:

the fixed effect intercept (b_Intercept). All fixed effect parameters in brm models begin with b, short for beta.
the standard deviation of the random block intercepts (sd_block:site__Intercept)
the standard deviation of the random site intercepts (sd_site__Intercept)
the standard deviation of the model’s residuals (sigma)

On the left are density plots for the posterior distributions of the parameters (made from the 200 posterior samples, 100 from each of the 2 chains).

On the right are trace plots for the 100 post-warmup posterior samples for each of the chains separately. The HMC algorithm is so good that even with purely default priors and only a very small number of samples, we actually get pretty close to convergence. That would never be possible with older algorithms, which would have required thousands of samples. But they aren’t quite converged. We still see that there are some trends in the trace plots where sometimes a chain is exploring one part of parameter space and sometimes another. Also, the two chains do not agree very well. This is a sign of failure to converge. The more complicated your models get, the more you will see this.

Dealing with convergence problems

What do we do about this? We can either increase the number of iterations or set different priors. For now, let’s increase the number of iterations to 1000 warmup and 1000 sampling (2000 total). We will get a total of 2000 posterior samples. This is usually the minimum number of warmup iterations and posterior samples you will want; many models may require more.

We can use the update() method to draw more samples from the posterior distribution without having to recompile the model code.

Note: from now on I will not show the output of the brms progress indicator in this tutorial, but you will continue to see it in your R console as you run the code.

fit_interceptonly_moresamples <- update(fit_interceptonly, chains = 2, iter = 2000, warmup = 1000)

Now let’s look at the convergence diagnostic plots.

plot(fit_interceptonly_moresamples)

convergence diagnostic plots for intercept-only model with more iterations

The posterior distributions’ density plots look smoother than in the previous plot (this is good) and the trace plots look like “hairy caterpillars” — successive samples just move back and forth around a central tendency, with no trend. There is nearly complete overlap between chain 1 and chain 2 — the chains are “mixing” well and so we can say from eyeballing it that we have converged on the solution.

Let’s call summary() again to look at the convergence diagnostic Rhat as well as some information about the parameters.

summary(fit_interceptonly_moresamples)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: yield ~ 1 + (1 | site) + (1 | block:site) 
##    Data: stvincent (Number of observations: 324) 
##   Draws: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 2000
## 
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.05      0.43     0.18     1.93 1.00      405      430
## 
## ~site (Number of levels: 9) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     4.79      1.39     2.85     8.25 1.00      540      865
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    16.14      1.56    12.87    19.24 1.00      390      837
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     3.93      0.16     3.63     4.26 1.00     2109     1266
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

We see low numbers for Rhat. Looking at the parameter values, we can see the fixed-effect or “population-level” intercept is ~16, with a 95% credible interval ranging from ~13 to ~19. We can also see the standard deviation for the “group-level” or random intercepts for both block and site sd(Intercept) and for the plots within blocks (the standard deviation of the overall model residuals sigma).

NOTE: The brms output says “group-level” effects for what are commonly called random effects, and “population-level” effects for what are commonly called fixed effects. In frequentist models, only random effects are modeled as being drawn at random from a statistical distribution. The distinction between fixed and random breaks down in a Bayesian context, because everything is modeled as being drawn from a distribution, including the fixed effects. That is why brms uses different words to refer to them. But I will continue to call them random and fixed effects because that is still the most common way to talk about them, and the new terminology hasn’t really caught on yet.

Credible intervals

But what is this “credible interval” thing? Doesn’t CI stand for confidence interval? You might be asking yourself this. Well, the posterior distribution for each parameter is a bunch of samples. We can calculate quantiles from this distribution. For example, we are 90% confident that the true value of the overall mean of yield across all fields is somewhere between the 5th and 95th percentile of the posterior distribution of that value. This is much more straightforward than the interpretation of a confidence interval from a frequentist model. But unfortunately someone chose “credible interval” for its name which has the same initials as “confidence interval.” In this lesson we will use quantile-based credible intervals (though there are other kinds) which you sometimes see called QI for “quantile interval,” or ETI for “equal-tailed interval.”

We can pull out all 2000 samples from the posterior distribution of the intercept and calculate or plot any credible interval we want. The function as_draws_df() gets all draws from the posterior distribution for each parameter in a model and puts them into columns of a data frame.

For example here is the median and the 90% quantile credible interval for the intercept.

post_samples <- as_draws_df(fit_interceptonly_moresamples)
post_samples_intercept <- post_samples$b_Intercept

median(post_samples_intercept)

## [1] 16.1286

quantile(post_samples_intercept, c(0.05, 0.95))

##       5%      95% 
## 13.50928 18.65418

A useful thing about Bayesian models is that you can get a posterior distribution, and any credible interval you want, for any component of the model: parameter estimates, predicted values, etc. This is much harder to do in a frequentist model. We will see some of that later on in the model — and we will see some really cool ways to visualize the information contained in posterior distributions, including credible intervals.

Variance decomposition

It is a good idea to examine the proportions of variance at different levels in your nested dataset. We do this separately for each random effect level with the function variance_decomposition() from the performance package (part of easystats). We specify which level we want to get the variance for using the re_formula argument.

variance_decomposition(fit_interceptonly_moresamples, re_formula = ~ (1 | site))

## # Random Effect Variances and ICC
## 
## Conditioned on: ~(1 | site)
## 
## ## Variance Ratio (comparable to ICC)
## Ratio: 0.50  CI 95%: [0.36 0.62]
## 
## ## Variances of Posterior Predicted Distribution
## Conditioned on fixed effects: 15.38  CI 95%: [12.36 19.23]
## Conditioned on rand. effects: 31.10  CI 95%: [24.98 38.60]
## 
## ## Difference in Variances
## Difference: 15.73  CI 95%: [9.41 22.74]

variance_decomposition(fit_interceptonly_moresamples, re_formula = ~ (1 | block:site))

## # Random Effect Variances and ICC
## 
## Conditioned on: ~(1 | block:site)
## 
## ## Variance Ratio (comparable to ICC)
## Ratio: 0.07  CI 95%: [-0.16 0.27]
## 
## ## Variances of Posterior Predicted Distribution
## Conditioned on fixed effects: 15.39  CI 95%: [12.29 19.31]
## Conditioned on rand. effects: 16.62  CI 95%: [13.28 20.57]
## 
## ## Difference in Variances
## Difference: 1.11  CI 95%: [-2.35 5.09]

This shows that the proportion of variance among blocks within site is not very high, but variance among sites is quite high. This indicates that a multilevel model including site as a random effect is necessary. Actually, if you have a dataset from a designed experiment with blocks, I would recommend still using a multilevel model even if the variance decomposition does not necessarily say it’s warranted. It’s better to stick with the model that your study design calls for. Thus we will keep both the block and site random terms in the following models we fit.

Mixed-effects model with a single fixed effect

The intercept-only model is a good start. But so far all we’ve done is pull out the random variation in mean yield by field. We want to make inference about what factors influence yield at the individual plot level. To do this, we can add a fixed effect predictor (slope) to the model.

We’ll start by adding a fixed effect for the N fertilization treatment. Because the slope (N treatment effect) is fixed and not random, that means we are assuming that the effect of N on yield is a single “global” effect that is the same across all sites, and across all blocks within sites. Also, note we still haven’t specified our own priors, so brms is defaulting to very wide and uninformative priors.

Finally, we are going to use 4 Markov chains from now on. It is good in practice to use at least 3 or 4 because the more chains you use, the more confident you can be that you have the correct answer if they converge.

fit_fixedN <- brm(yield ~ 1 + N + (1 | site) + (1 | block:site),
                  data = stvincent,
                  chains = 4,
                  iter = 2000,
                  warmup = 1000,
                  seed = 2,
                  file = 'fits/fit_fixedN')

Notice the seed and file arguments.

The seed argument (you can use any integer you want) sets the random seed so that we can reproduce the exact behavior of the MCMC chains again in case we need to refit the model. This is very important for reproducibility!
The file argument means a file called fit_fixedN.rds is created in a subdirectory called fits in your current working directory. This means that you can reload the fitted model object in a future R session without having to compile and sample the model again. If you run the same brm code again and the data haven’t changed, it will look for the file and load it instead of refitting the model. This can save a lot of time if you have a model that takes hours or days to sample!

summary() shows low R-hat statistics and also shows that most of the mass of the posterior distribution for the three N effects is well above zero. This indicates evidence that the 1N, 2N, and 3N treatments have a higher mean yield than the 0N treatment. We have roughly the same random effect variance parameters as before, but now sigma (the standard deviation of the residuals for individual plots) is lower than before. This makes sense because we have some effects in the model that explain some of the variance and result in lower residuals.

summary(fit_fixedN)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: yield ~ 1 + N + (1 | site) + (1 | block:site) 
##    Data: stvincent (Number of observations: 324) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.25      0.37     0.49     2.01 1.00     1165      919
## 
## ~site (Number of levels: 9) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     4.77      1.34     2.85     8.01 1.00     1071     1829
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    13.05      1.70     9.66    16.42 1.00      773     1630
## N1N           3.42      0.51     2.41     4.44 1.00     2984     2887
## N2N           3.95      0.59     2.78     5.09 1.00     2796     2608
## N3N           4.99      0.70     3.61     6.36 1.00     2927     2776
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     3.52      0.15     3.25     3.82 1.00     3392     3250
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Modifying priors

Here let’s also inspect what default priors brms chose for us, and show how to modify them to see if there was any effect on the results. prior_summary() shows us the priors used in fitting a model.

prior_summary(fit_fixedN)

##                    prior     class      coef      group resp dpar nlpar lb ub
##                   (flat)         b                                           
##                   (flat)         b       N1N                                 
##                   (flat)         b       N2N                                 
##                   (flat)         b       N3N                                 
##  student_t(3, 15.1, 6.7) Intercept                                           
##     student_t(3, 0, 6.7)        sd                                       0   
##     student_t(3, 0, 6.7)        sd           block:site                  0   
##     student_t(3, 0, 6.7)        sd Intercept block:site                  0   
##     student_t(3, 0, 6.7)        sd                 site                  0   
##     student_t(3, 0, 6.7)        sd Intercept       site                  0   
##     student_t(3, 0, 6.7)     sigma                                       0   
##        source
##       default
##  (vectorized)
##  (vectorized)
##  (vectorized)
##       default
##       default
##  (vectorized)
##  (vectorized)
##  (vectorized)
##  (vectorized)
##       default

This shows us that t-distributions were used for the priors on the intercept, the random effect SD, and the residual SD sigma. The mean of the t-distribution used as the prior on the intercept is 15.1 (this was gotten by just calculating the mean of the yield values in the data) which makes sense. The mean of the variance parameters is 0 but the lower bound lb is also 0. This means we have something like “half bell curves” for the variance parameters. The standard deviation of all the t-distribution priors is 6.7. These are sensible defaults and rarely need to be modified.

What about the priors on the fixed effect slopes? They are flat priors. This means we are assigning exactly the same prior probability to any possible value the slope could take. Yes, that’s an uninformative prior, but in practice it doesn’t make sense to do this. Take the prior on 1N for example. We are saying we assume that the prior probability that 0N and 1N have 0 difference in yield is equal to the prior probability that 1N has 10000 more units of yield than 0N, or -12345 units less, and so on. This makes no sense (especially because in our dataset yield only ranges between about 3 and 30)! Luckily, we have a simple model and a decent amount of data so the model converges on a reasonable solution even with this flat prior. But if you have a more complex model, it is often a bad idea to let brms use these weak flat default priors. So even if you philosophically don’t want to “put your thumb on the scale,” it’s a good idea for practical reasons to specify at least a weak prior.

Let’s refit the model with reasonable priors on the fixed effect slopes. A normal distribution with mean 0 and a modest standard deviation, say 10, is a good choice. I like to think of it as a “null hypothesis” because it has equally as much prior probability for positive coefficients as negative. Also it allows for the possibility of large effect sizes: in this context a slope of +10 or -10 would mean that going from one N fertilizer level to another would increase or decrease yield by 10 units, which is a lot for this dataset. And 10 is only one standard deviation; we are still allowing a substantial prior probability (about 1/3) that the effect could be greater than that. Thus we are being very agnostic and not assuming anything much about the effect. A weakly informative prior is usually beneficial for model convergence because there is very little chance the sampler will waste time trying out ridiculous values like 10000. We use the prior argument to brm and specify a vector of objects created with the prior() function. class = b means apply the same prior to all fixed effect slope parameters, which are often called “beta” in formal statistical notation.

fit_fixedN_priors <- brm(yield ~ 1 + N + (1 | site) + (1 | block:site),
                         data = stvincent,
                         prior = c(prior(normal(0, 10), class = b)),
                         chains = 4,
                         iter = 2000,
                         warmup = 1000,
                         seed = 2,
                         file = 'fits/fit_fixedN_priors')

summary(fit_fixedN_priors)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: yield ~ 1 + N + (1 | site) + (1 | block:site) 
##    Data: stvincent (Number of observations: 324) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.27      0.36     0.59     1.99 1.00     1513     1551
## 
## ~site (Number of levels: 9) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     4.70      1.34     2.79     7.90 1.01     1066     1790
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    13.07      1.66     9.75    16.18 1.00      661     1119
## N1N           3.39      0.51     2.43     4.43 1.00     3344     2903
## N2N           3.93      0.58     2.79     5.07 1.00     3217     3142
## N3N           4.96      0.71     3.55     6.42 1.00     3775     2829
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     3.52      0.14     3.26     3.83 1.00     3902     3254
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

In this case, the new priors have a very modest effect on the slope coefficients. They are a little bit lower than when using the flat default priors. We have formally incorporated information about what we believe is a plausible effect size into the model. For the most part, the data spoke for themselves and the prior had little influence. But it’s a good idea to be mindful of what you are doing and not blindly use default priors all the time. And it goes without saying that you should always report what priors you used in the methods of any publication where you present this kind of analysis.

Posterior predictive check

Whenever you fit a model, another good diagnostic to look at is the posterior predictive check. Use pp_check() to automatically generate a plot.

pp_check(fit_fixedN_priors)

posterior predictive check plot for random-intercept model

This plot shows a density plot of the observed data (black line) and density plots for the fitted values of the model for (by default) 10 random samples from the posterior distribution of the parameters (thin blue lines). Here, you can see that the agreement is close but not perfect. The observed data has a bimodal distribution, but that is not captured by the posterior samples, which look more or less like bell curves with only one peak. This may indicate we are leaving something important out of our model. However if we fit a more complex model, it is important not to make it too complex and overfit to the data. Later we will look at ways to compare models more formally.

Exploring posterior estimates

Now let’s look a bit closer at the posterior distributions of the fixed effect slope parameters. summary() only gives us a point estimate and the upper and lower bounds of the 95% quantile-based credible interval. That’s a good start but we might want to explore the posterior distribution in more detail. After all, Bayesian inference gives us a fully characterized uncertainty distribution. We have 4000 posterior samples for each parameter. Let’s use some functions from the tidybayes package to make tables and plots of the slope parameters. tidybayes is an excellent package with lots of cool tricks for manipulating and visualizing the output of a brms model.

First we use gather_draws() to pull out the posterior samples for some variables of our choosing and make a “tidy” data frame out of them.

posterior_slopes <- gather_draws(fit_fixedN_priors, b_Intercept, b_N1N, b_N2N, b_N3N)

posterior_slopes

## # A tibble: 16,000 × 5
## # Groups:   .variable [4]
##    .chain .iteration .draw .variable   .value
##     <int>      <int> <int> <chr>        <dbl>
##  1      1          1     1 b_Intercept   15.2
##  2      1          2     2 b_Intercept   15.3
##  3      1          3     3 b_Intercept   14.7
##  4      1          4     4 b_Intercept   14.0
##  5      1          5     5 b_Intercept   13.5
##  6      1          6     6 b_Intercept   12.9
##  7      1          7     7 b_Intercept   13.4
##  8      1          8     8 b_Intercept   14.0
##  9      1          9     9 b_Intercept   15.5
## 10      1         10    10 b_Intercept   14.7
## # ℹ 15,990 more rows

The result is 16,000 rows (4000 posterior samples for each of the intercept and three slope parameters) and has columns indicating which chain and which iteration each sample comes from. Because the chains are all converged and mixed, we can ignore which chain each sample comes from and pool them all together.

We can calculate summary statistics (quantiles) for each of the parameters with median_qi(). By default it gives 95% intervals but we can specify any width(s).

posterior_slopes %>%
  median_qi(.width = c(.66, .95, .99))

## # A tibble: 12 × 7
##    .variable   .value .lower .upper .width .point .interval
##    <chr>        <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
##  1 b_Intercept  13.1   11.6   14.6    0.66 median qi       
##  2 b_N1N         3.39   2.91   3.88   0.66 median qi       
##  3 b_N2N         3.93   3.38   4.49   0.66 median qi       
##  4 b_N3N         4.97   4.28   5.64   0.66 median qi       
##  5 b_Intercept  13.1    9.75  16.2    0.95 median qi       
##  6 b_N1N         3.39   2.43   4.43   0.95 median qi       
##  7 b_N2N         3.93   2.79   5.07   0.95 median qi       
##  8 b_N3N         4.97   3.55   6.42   0.95 median qi       
##  9 b_Intercept  13.1    8.03  17.6    0.99 median qi       
## 10 b_N1N         3.39   2.09   4.69   0.99 median qi       
## 11 b_N2N         3.93   2.39   5.42   0.99 median qi       
## 12 b_N3N         4.97   3.17   6.79   0.99 median qi

Estimated marginal means

The default model summary gives us four fixed effect parameters. The intercept is the model’s posterior estimate of mean yield at the reference level (0 added N). The three slopes are the difference between the reference level and the other three treatments. But it would be nice to also know what the model predicts for mean yield in the other three treatments, and to estimate how big of a difference it predicts between other pairs of treatments not including the 0 N treatment. We can estimate marginal means for the treatments in a Bayesian model just like we do in a frequentist model. We do this by working with the posterior distributions. The marginal means are estimated for each draw from the posteriors.

Let’s do an example “by hand” first. To find the mean yield prediction for the 1 N treatment, we add the posterior samples of the intercept (mean in the 0 N treatment) and the posterior samples of the slope for 1 N (difference between 0 N and 1 N treatments).

posterior_mean <- pivot_wider(posterior_slopes, names_from = .variable, values_from = .value) %>%
  mutate(mean_1N = b_Intercept + b_N1N)

median_qi(posterior_mean$mean_1N)

##          y     ymin     ymax .width .point .interval
## 1 16.50203 13.17649 19.52203   0.95 median        qi

We see that the median of the distribution is ~16 with a 95% credible interval ranging from ~13 to ~20.

We can use the function emmeans() from the emmeans package which implements all the math of estimating the marginal means for us. This saves us the trouble of writing out the code to sum up all the posterior samples:

N_emmeans <- emmeans(fit_fixedN_priors, ~ N)

N_emmeans

##  N  emmean lower.HPD upper.HPD
##  0N   13.1      9.78      16.2
##  1N   16.5     13.42      19.7
##  2N   17.0     13.98      20.4
##  3N   18.1     14.76      21.4
## 
## Point estimate displayed: median 
## HPD interval probability: 0.95

What if we wanted to test hypotheses about differences between specific pairs of means? We can do that with our Bayesian model by subtracting the posterior samples of one mean from the other. The contrast() function in the emmeans package makes that straightforward:

contrast(N_emmeans, method = 'pairwise')

##  contrast estimate lower.HPD upper.HPD
##  0N - 1N    -3.387     -4.38    -2.422
##  0N - 2N    -3.930     -5.04    -2.770
##  0N - 3N    -4.966     -6.47    -3.631
##  1N - 2N    -0.549     -1.55     0.415
##  1N - 3N    -1.577     -2.77    -0.303
##  2N - 3N    -1.023     -2.46     0.379
## 
## Point estimate displayed: median 
## HPD interval probability: 0.95

Instead of trying to reject a null hypothesis that any of these differences is equal to zero, we look at the effect size and credible interval for each difference. We can see that there is evidence that the 0 N treatment has lower yield than the other 3 treatments. For example, the posterior probability is 95% that the difference between 0 N and 1 N is between ~-4.5 and ~-2.5. But when comparing 1 N and 2 N, we see the posterior median of the difference is only ~-0.5 and the posterior probability is 95% that the difference is between ~-1.5 and ~0.5. This is consistent with no difference (but also consistent with a small but nonzero difference)!

There are Bayesian versions of null hypothesis tests, too, which we’ll cover shortly.

Plotting posterior estimates of means

We can also use some special ggplot2 extensions to make different kinds of plots that illustrate the shape of the posterior and the credible intervals. Here are two ways of doing it. First we pull out the posterior samples from the emmeans object with the function gather_emmeans_draws().

We can make a density plot of the posterior samples of the estimated marginal means, with the median and some credible intervals shown as error bars.

post_emm_draws <- gather_emmeans_draws(N_emmeans)

ggplot(post_emm_draws, aes(y = N, x = .value)) +
  stat_halfeye(.width = c(.8, .95)) +
  labs(x = 'estimated marginal mean yield')

halfeye density plots of posterior slope distributions

I also like to plot the credible intervals as shaded bars, with the median as a point estimate in the middle.

ggplot(post_emm_draws, aes(y = N, x = .value)) +
  stat_interval() +
  stat_summary(fun = median, geom = 'point', size = 2) +
  scale_color_brewer(palette = 'Blues') +
  labs(x = 'estimated marginal mean yield')

shaded interval plots of posterior slope distributions

Because these posteriors are so symmetrical and resemble normal distributions, the credible intervals are a good way of describing the distribution and it isn’t really necessary to show the density plots. But for many skewed posteriors, it is often helpful to illustrate the shape of the distribution.

We will show more examples of how to summarize and visualize predictions later on.

Mixed-effects model with multiple fixed effects

We have ignored the P and K treatments so far. Let’s include those terms in the model as well. We just use the + sign to add additional terms to the fixed effect portion of the model specification. We will continue to use our weakly informative priors on the fixed effect slopes.

fit_fixedNPK <- brm(yield ~ 1 + N + P + K + (1 | site) + (1 | block:site),
                    data = stvincent,
                    prior = c(prior(normal(0, 10), class = b)),
                    chains = 4,
                    iter = 2000,
                    warmup = 1000,
                    seed = 3,
                    file = 'fits/fit_fixedNPK')

Note: I am not going to include plot(fit) or pp_check(fit) explicitly in this tutorial anymore in the interest of space. But you should always generate those diagnostics for any model you fit as a matter of routine.

Let’s look at the summary of this fit.

summary(fit_fixedNPK)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: yield ~ 1 + N + P + K + (1 | site) + (1 | block:site) 
##    Data: stvincent (Number of observations: 324) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.28      0.36     0.60     2.01 1.00     1374     1622
## 
## ~site (Number of levels: 9) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     4.74      1.37     2.84     8.18 1.00     1302     1924
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    12.93      1.66     9.63    16.23 1.00     1263     1947
## N1N           0.73      5.78   -10.43    12.08 1.00     2250     2267
## N2N           3.95      0.59     2.81     5.11 1.00     6133     3057
## N3N           2.10      5.80    -9.11    13.61 1.00     2240     2343
## P1P           1.96      5.54    -9.15    12.68 1.00     2554     2512
## P2P           0.75      0.56    -0.34     1.84 1.00     5348     3156
## P3P           0.81      5.55   -10.09    11.57 1.00     2551     2547
## K1K           1.12      5.81   -10.62    12.57 1.00     2186     2319
## K2K          -0.36      0.59    -1.53     0.80 1.00     5989     2964
## K3K           1.43      5.82   -10.20    12.87 1.00     2240     2365
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     3.51      0.15     3.24     3.81 1.00     3947     3221
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

We have many more fixed effect parameters than before. Now that we’ve added more predictors, the variation explained by N addition is less and there is less evidence for difference between the means of the different N treatments. The 95% credible interval of most of the slopes is consistent with no effect, but we still see that most of the probability mass for the 2N term is well above 0.

Feel free to explore the predictions of this model on your own. Next we will add a bit more complexity to the model by incorporating random slopes.

Mixed-effects model with random slopes

So far, we have assumed that the effects of N, P, and K are constant across all the sites. We’ve allowed the intercept for yield (representing the baseline yield at each site) to be different from site to site, but not the slope with respect to N (representing the N treatment effect at each site). But if you recall from the plot we made, the response to N fertilization was very different at the different sites. We can add random slopes terms to the model to allow not only the intercept, but also the slope, of the yield-N relationship to vary by site.

To specify random slopes, we modify the design side of the random-effect term. For our random intercept model, we used (1 | site). To fit a random slope for each site with respect to N, P, and K, we will use (1 + N + P + K | site). We could also do this for blocks within sites but I would not advise that. That’s because there are only a few data points within each block, sometimes only one for each N treatment. That’s just barely enough to get a good estimate of average yield by block for the random block intercept, but not enough to get a good estimate of treatment effect within a single block.

fit_randomNPKslopes <- brm(yield ~ 1 + N + P + K + (1 + N + P + K | site) + (1 | block:site),
                           data = stvincent,
                           prior = c(prior(normal(0, 10), class = b)),
                           chains = 4,
                           iter = 2000,
                           warmup = 1000,
                           seed = 4,
                           file = 'fits/fit_randomNPKslopes')

Let’s take a look at this fit.

summary(fit_randomNPKslopes)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: yield ~ 1 + N + P + K + (1 + N + P + K | site) + (1 | block:site) 
##    Data: stvincent (Number of observations: 324) 
##   Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup draws = 4000
## 
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     1.43      0.32     0.86     2.14 1.00     1317     2230
## 
## ~site (Number of levels: 9) 
##                    Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)          4.51      1.40     2.48     7.81 1.00     2655     2904
## sd(N1N)                0.88      0.72     0.03     2.62 1.00     3031     2660
## sd(N2N)                3.42      1.07     1.84     6.03 1.00     2547     2941
## sd(N3N)                3.52      1.37     1.42     6.73 1.00     1839     1781
## sd(P1P)                1.16      0.82     0.06     3.09 1.00     2339     2311
## sd(P2P)                1.74      0.93     0.20     3.87 1.00     1705     1311
## sd(P3P)                1.06      0.85     0.04     3.17 1.00     2807     2207
## sd(K1K)                1.46      0.94     0.08     3.60 1.00     1493     2056
## sd(K2K)                1.46      0.85     0.12     3.38 1.00     1986     2010
## sd(K3K)                2.07      1.23     0.14     4.85 1.00     1953     1756
## cor(Intercept,N1N)    -0.01      0.29    -0.56     0.57 1.00     7589     3076
## cor(Intercept,N2N)     0.02      0.24    -0.46     0.48 1.00     3476     3209
## cor(N1N,N2N)           0.07      0.31    -0.56     0.65 1.00     1453     2545
## cor(Intercept,N3N)    -0.02      0.26    -0.51     0.48 1.00     4613     3126
## cor(N1N,N3N)           0.08      0.30    -0.53     0.64 1.00     1945     2522
## cor(N2N,N3N)           0.39      0.25    -0.14     0.81 1.00     3595     3046
## cor(Intercept,P1P)    -0.02      0.28    -0.56     0.53 1.00     5935     3013
## cor(N1N,P1P)          -0.03      0.31    -0.60     0.56 1.00     4491     3189
## cor(N2N,P1P)           0.20      0.29    -0.41     0.70 1.00     4826     3351
## cor(N3N,P1P)           0.11      0.29    -0.46     0.64 1.00     4351     2745
## cor(Intercept,P2P)     0.18      0.27    -0.39     0.66 1.00     5161     2974
## cor(N1N,P2P)           0.04      0.31    -0.56     0.61 1.00     3949     3359
## cor(N2N,P2P)          -0.07      0.27    -0.58     0.45 1.00     4943     3179
## cor(N3N,P2P)          -0.01      0.27    -0.53     0.51 1.00     4582     3207
## cor(P1P,P2P)           0.07      0.29    -0.52     0.62 1.00     3258     3097
## cor(Intercept,P3P)     0.02      0.30    -0.55     0.58 1.00     8262     2937
## cor(N1N,P3P)          -0.02      0.32    -0.62     0.58 1.00     4751     2772
## cor(N2N,P3P)           0.01      0.29    -0.55     0.55 1.00     5769     3382
## cor(N3N,P3P)           0.01      0.30    -0.56     0.59 1.00     4467     3233
## cor(P1P,P3P)           0.04      0.30    -0.56     0.61 1.00     3380     3376
## cor(P2P,P3P)          -0.01      0.30    -0.58     0.57 1.00     3387     3250
## cor(Intercept,K1K)     0.13      0.28    -0.43     0.64 1.00     5402     3227
## cor(N1N,K1K)          -0.06      0.31    -0.64     0.54 1.00     3860     3036
## cor(N2N,K1K)           0.15      0.29    -0.43     0.65 1.00     4843     3388
## cor(N3N,K1K)           0.08      0.29    -0.49     0.62 1.00     4683     3300
## cor(P1P,K1K)          -0.00      0.29    -0.56     0.56 1.00     3570     3652
## cor(P2P,K1K)           0.18      0.29    -0.43     0.68 1.00     3080     3363
## cor(P3P,K1K)          -0.05      0.30    -0.60     0.54 1.00     3043     3459
## cor(Intercept,K2K)    -0.07      0.28    -0.58     0.47 1.00     6733     3011
## cor(N1N,K2K)           0.00      0.30    -0.58     0.58 1.00     3615     3198
## cor(N2N,K2K)          -0.23      0.27    -0.70     0.33 1.00     5140     3182
## cor(N3N,K2K)          -0.16      0.28    -0.68     0.38 1.00     4474     3169
## cor(P1P,K2K)          -0.08      0.30    -0.62     0.50 1.00     2867     3111
## cor(P2P,K2K)          -0.09      0.29    -0.62     0.48 1.00     3557     3587
## cor(P3P,K2K)           0.02      0.30    -0.56     0.59 1.00     2961     3302
## cor(K1K,K2K)          -0.05      0.29    -0.59     0.52 1.00     3079     3455
## cor(Intercept,K3K)    -0.19      0.28    -0.68     0.38 1.00     5354     3335
## cor(N1N,K3K)          -0.04      0.30    -0.61     0.55 1.00     4088     2963
## cor(N2N,K3K)           0.16      0.27    -0.40     0.64 1.00     4161     2965
## cor(N3N,K3K)           0.07      0.28    -0.49     0.60 1.00     3998     3209
## cor(P1P,K3K)          -0.04      0.30    -0.61     0.52 1.00     3892     3320
## cor(P2P,K3K)          -0.11      0.30    -0.64     0.48 1.00     3171     3207
## cor(P3P,K3K)           0.00      0.29    -0.56     0.58 1.00     2877     3395
## cor(K1K,K3K)           0.01      0.30    -0.57     0.56 1.00     3144     3620
## cor(K2K,K3K)          -0.03      0.28    -0.58     0.52 1.00     3135     3456
## 
## Regression Coefficients:
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept    12.92      1.54     9.76    15.96 1.00     1727     2081
## N1N           0.76      5.98   -11.17    12.62 1.00     3112     2494
## N2N           3.82      1.27     1.20     6.30 1.00     2513     2657
## N3N           2.08      6.03   -10.03    13.91 1.00     3120     2611
## P1P           1.94      5.88    -9.55    13.47 1.00     2997     3023
## P2P           0.73      0.82    -0.91     2.34 1.00     3078     2698
## P3P           0.74      5.92   -10.63    12.48 1.00     3009     2997
## K1K           1.05      6.05   -11.06    12.82 1.00     3252     2820
## K2K          -0.37      0.76    -1.88     1.22 1.00     3670     2996
## K3K           1.46      6.06   -10.61    13.25 1.00     3232     2788
## 
## Further Distributional Parameters:
##       Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma     3.06      0.14     2.80     3.34 1.00     3478     3280
## 
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).

Wow, that’s a lot of parameters! The reason there are so many is that the random intercept and the three random slopes, each of which has three levels, all have a standard deviation and a correlation with one another.

Let’s take a look at the overall predictions for response to N addition. Note these means now average across a lot of things: the random intercepts by site and block (differing baselines), the random slopes by site (differing responses), and the other fixed effects (predictions are made averaged across all levels of P and K addition).

N_emmeans <- emmeans(fit_randomNPKslopes, ~ N)

N_emmeans

##  N  emmean lower.HPD upper.HPD
##  0N   14.3      7.74      20.8
##  1N   15.0      8.28      21.5
##  2N   18.1     11.06      25.0
##  3N   16.3      9.37      23.2
## 
## Results are averaged over the levels of: P, K 
## Point estimate displayed: median 
## HPD interval probability: 0.95

N_contrasts <- contrast(N_emmeans, 'pairwise')

N_contrasts

##  contrast estimate lower.HPD upper.HPD
##  0N - 1N    -0.698    -12.97     10.54
##  0N - 2N    -3.839     -6.52     -1.48
##  0N - 3N    -1.975    -14.40      9.35
##  1N - 2N    -3.202    -15.12      8.72
##  1N - 3N    -1.326     -4.20      1.49
##  2N - 3N     1.923    -10.45     13.18
## 
## Results are averaged over the levels of: P, K 
## Point estimate displayed: median 
## HPD interval probability: 0.95

The point estimates of the marginal means are closer together than in the model where we ignored P and K treatments and variation in response by site, and the 95% credible intervals are wider.

What about the site-level variation in responses? You can use the ranef() function to get a quick summary of the random effects. I won’t show the output here to save space.

ranef(fit_randomNPKslopes)

We can use the tidybayes function add_epred_draws() to generate predictions for any desired combination of fixed and random effects, and plot their posterior distributions. Here epred refers to expectation of the posterior predictive.

Here, we will create a plot of site-level variation in N response. Predictions are made for all levels of N addition (fixed effect) at each site (random intercept and slope), at the control levels of P and K treatment, and averaging over the block-level random intercepts. Medians are connected with a line.

N_epred <- expand_grid(N = unique(stvincent$N),
                       P = '0P',
                       K = '0K',
                       site = unique(stvincent$site)) %>%
  add_epred_draws(fit_randomNPKslopes, re_formula = ~ (1 + N | site), value = 'yield')

ggplot(N_epred, aes(x = N, y = yield)) +
  stat_interval() +
  stat_summary(fun = median, geom = 'point', size = 2) +
  stat_summary(aes(x = as.numeric(N)), fun = median, geom = 'line') +
  scale_color_brewer(palette = 'Greens') +
  facet_wrap(~ site)

credible interval plots of site-level N responses

If you compare this plot with the original boxplots we made of the raw data at the beginning of this lesson, you might notice the trends within site are a little bit less extreme than the raw data might indicate. The difference between the model’s predictions of yield for each N level is a little bit smaller than the differences in the raw data for each site. That illustrates what’s called “shrinkage”: the mixed model is adjusting for the overestimation of trends we would have if we fit separate models for each field. The slopes “shrink” toward the mean overall slope. This makes the model provide better predictions for new data, at the expense of fitting the original dataset a little less closely. Check out this animated demo of shrinkage for more details!

Incidentally, generating this kind of plot with uncertainty intervals around random effect estimates is not easy to do for frequentist models. You usually have to do some kind of bootstrapping technique. But Bayesian models make it easy to estimate uncertainty around any quantity in the model with posterior distributions!

Interactions between fixed effects

We can also add any interaction terms we would like if we want to test whether any of the fixed effects interact. For example, is the effect of N on yield stronger if no P is added, or if a high level of P is added? You can specify the interaction between fixed effects a and b by adding a:b to the fixed-effect part of the model formula, or a:b:c for three-way interactions, and so on.

I will leave the interaction models for you to fit as an exercise.

Comparing models and assessing evidence

Comparing models with information criteria

The models with different specifications for fixed and random effects have different numbers of parameters. We have used the pp_check() plot to “eyeball” how well the models fit the data. We can also use leave-one-out (LOO) cross-validation to produce an information criterion for any of the models we’ve fit so far, and then compare them. In brms this is a two-step process. First use add_criterion() to compute an information criterion for each model, then use loo_compare() to rank a set of models by information criterion.

Let’s compare all the models we’ve fit: the intercept-only model, the model with N fixed effect only, the model with NPK fixed effects, the model with NPK fixed effects and all random slopes, and the model with NPK fixed effects but only random slopes for N.

fit_interceptonly_moresamples <- add_criterion(fit_interceptonly_moresamples, 'loo')
fit_fixedN_priors <- add_criterion(fit_fixedN_priors, 'loo')
fit_fixedNPK <- add_criterion(fit_fixedNPK, 'loo')
fit_randomNPKslopes <- add_criterion(fit_randomNPKslopes, 'loo')

loo_compare(fit_interceptonly_moresamples, fit_fixedN_priors, fit_fixedNPK, fit_randomNPKslopes)

##                               elpd_diff se_diff
## fit_randomNPKslopes             0.0       0.0  
## fit_fixedN_priors             -20.5       8.1  
## fit_fixedNPK                  -21.4       7.8  
## fit_interceptonly_moresamples -51.8      12.2

NOTE: You may see a warning about problematic observations. The number of problematic observations is small enough that we can ignore this warning for now. If many observations are found to be problematic, you might need to follow the recommended steps to deal with that issue.

This output ranks the models from best fit to worst by ELPD, or expected log pointwise predictive density (a measure of how well the model fit to all but one data point from the original dataset predicts the new data point). The elpd_diff or $\Delta ELPD$ column is the difference between each other model and the one with the highest ELPD (thus the top row will always have elpd_diff of zero). The se_diff column is the standard error of that difference.

Sivula et al. (2020) say that two models with $\Delta ELPD < 4$ are “very similar.” You can translate the $\Delta ELPD$ into a p-value by dividing elpd_diff/se_diff to get a standardized difference, or z-score. Then get the p-value using the normal approximation: 2 * pnorm(abs(-elpd_diff/se_diff)). Z-scores greater than ~2 will have a p-value < 0.05, if statistical significance is your thing.

We see that the fit with all random slopes included is ranked best in our comparison. The models without any random slopes perform worse ($\Delta ELPD \approx 20$, translating to a p-value of ~0.01), and the intercept-only model is a very bad fit (not surprisingly). This should emphasize to us that averaging the response to fertilization across all sites does not tell the whole story in this system. We have to account for site-level variation in response. As a follow-up, it might be a good idea to measure different covariates at each site to try to account for this variation in response, instead of just treating among-site variation as random.

Assessing strength of evidence

“Credible intervals are great and all,” you may be saying to yourself, “but what about a p-value? How am I supposed to know if x has an effect on y or not?” It’s understandable to feel that way. Bayesian analysis frees us from the restriction of always having to do all our tests against a (not necessarily realistic or relevant) null hypothesis and trying to reject it. But many people are more comfortable still thinking in the null hypothesis significance testing framework, and it may be important and necessary to test null hypotheses in certain real-world situations. Luckily, there are some Bayesian analogues to p-values that allow us to assess the strength of evidence for null hypotheses (see Makowski et al. 2019, Frontiers in Psychology).

Bayes Factors

One type of analogue to a p-value for a Bayesian model is the Bayes factor (BF). This is the ratio of evidence between two models (going back to Bayes’ theorem, it’s $\frac{P(model_1|data)}{P(model_2|data)}$. BFs can range from 0 to infinity. A BF of 1 means a 1:1 ratio (we have equal evidence for both models). As BF increases, we have more and more evidence for model 1.

There’s no universally agreed-upon “cutoff” for declaring “significance” of a BF. There are two different interpretation tables shown in the Wikipedia article on Bayes factors. Roughly speaking, a BF of 10 or greater would indicate strong evidence.

The R package bayestestR, also part of easystats, allows us to compute a BF for individual model parameters by calculating the ratio of evidence for the posterior to evidence for the prior distribution of each parameter. We can use these values as a measure of the strength of evidence for each parameter’s posterior being different than the prior. Because our prior distributions were centered around zero, we can consider them roughly like “null hypotheses.” Thus a BF = 1 means our data didn’t change our belief about the “null” prior at all, BF > 1 would mean we have stronger evidence for the posterior, and BF < 1 means we believe more in our original null prior after seeing the data.

Let’s say we want to test the set of null hypotheses for our pairwise comparisons between N addition levels. We have a posterior distribution of the pairwise difference between the means for each N level. We need to sample from the prior distribution of the model (for this we use unupdate()) and calculate the same set of comparisons. Then, we can calculate a BF for each of the six pairwise comparisons. Like p-values, BFs may be one-sided or two-sided (two-sided is the default). By default the null hypothesis that the difference is 0 is tested, but this default may be changed.

fit_randomNPKslopes_prior <- unupdate(fit_randomNPKslopes)
N_emmeans_prior <- emmeans(fit_randomNPKslopes_prior, ~ N)
N_contrasts_prior <- contrast(N_emmeans_prior, 'pairwise')

bayesfactor_parameters(posterior = N_contrasts, prior = N_contrasts_prior)

## Warning: Bayes factors might not be precise.
##   For precise Bayes factors, sampling at least 40,000 posterior samples is
##   recommended.

## Bayes Factor (Savage-Dickey density ratio)
## 
## contrast |    BF
## ----------------
## 0N - 1N  | 0.557
## 0N - 2N  |  5.74
## 0N - 3N  | 0.624
## 1N - 2N  | 0.497
## 1N - 3N  | 0.168
## 2N - 3N  | 0.481
## 
## * Evidence Against The Null: 0

Here we see that BF < 1 for most of the fixed effect slopes, except for the comparison between 0N and 2N, where BF is ~8. This indicates moderately strong evidence that the 2N treatment has different mean yield than the unfertilized control, but we have no evidence for any other fertilization effect on yield.

Warning: Bayes factors are very sensitive to your choice of prior. This is true even if your prior has little or no influence on the shape of the posterior. If you do decide to report BFs in a publication, be sure to carefully document how they were calculated.

Also notice that we see a warning about generating more posterior samples to get a precise estimate of BF. This is a good idea in practice but we won’t bother with that for this lesson.

Probability of direction (p_d)

Recently, additional analogues to p-values have been developed for Bayesian models and implemented in the bayestestR package. I will briefly show you two of them here.

Probably the simplest Bayesian p-value analogue is the probability of direction, $p_d$. It is simply the proportion of the posterior distribution on the same side of the null value as the point estimate. If the null value is zero, then that means the proportion of posterior samples with the same sign as the median. For example if your median point estimate for a contrast is a positive value, and 999 out of 1000 posterior samples are > 0, then $p_d = 0.999$. However I prefer to express probability of direction on the same scale as a classical two-sided p-value, $2(1 - p_d)$. This has a similar interpretation to a frequentist p-value, with a lower value meaning lower probability that the null hypothesis is true. It would be 0.002 in our example.

We use the function pd() to get a probability of direction for each of our contrasts. The default is null = 0 which I am specifying here explicitly. You need to change this if, for example, you are testing odds ratios where 1 is a more sensible null value. The argument as_p = TRUE gives us the two-sided p-value.

pd(N_contrasts, null = 0, as_p = TRUE)

## Probability of Direction
## 
## contrast |     p
## ----------------
## 0N - 1N  | 0.905
## 0N - 2N  | 0.013
## 0N - 3N  | 0.732
## 1N - 2N  | 0.605
## 1N - 3N  | 0.304
## 2N - 3N  | 0.767

Maximum a posteriori (MAP)-based p-value

Another p-value analogue is the MAP-based p-value or $p_{MAP}$. It is related to the odds against a single null value, compared to the mode of the posterior distribution of a model parameter or estimate (the “maximum a posteriori” or MAP value). This value has a similar interpretation to a p-value, where a lower value means lower probability that the null hypothesis is true. We use the function p_map() to generate a MAP-based p-value for each of our contrasts.

p_map(N_contrasts, null = 0)

## MAP-based p-value
## 
## contrast | p (MAP)
## ------------------
## 0N - 1N  |   0.998
## 0N - 2N  |   0.023
## 0N - 3N  |   0.941
## 1N - 2N  |   0.816
## 1N - 3N  |   0.520
## 2N - 3N  |   0.891

The interpretations of $p_d$, $p_{MAP}$, and Bayes factors are similar here, where we have evidence that the difference between 0N and 2N is not zero, but not for any of the others. Note that $p_d$ and $p_{MAP}$ are not as sensitive to your choice of prior as BF is, but they are sensitive to the method used to estimate the posterior density — I won’t get into the details here. For more details check out this paper by Makowski et al. or this slide deck I made comparing different Bayesian-style p-values.

Conclusion

What did we just learn? Let’s take a look back at the learning objectives!

Conceptual learning objectives

You now understand…

The basics of Bayesian inference
Definition of prior, likelihood, and posterior
How Markov Chain Monte Carlo works
What a credible interval is

Practical skills

You now can…

Write brms code to fit a multilevel model with random intercepts and random slopes
Diagnose and deal with convergence problems
Interpret brms output
Make plots of model parameters and predictions with credible intervals
Compare models with LOO information criteria
Use Bayes factors and Bayesian analogues of p-values to assess strength of evidence for effects

That’s an amazing set of skills!

Going further

As I’ve already mentioned, this is not intended to be a replacement for a full Bayesian course. Richard McElreath has written an amazing Bayesian course called Statistical Rethinking. It has a print book and ebook version, as well as a set of free video lectures. Solomon Kurz translated all the code examples in the book to brms, ggplot2, and tidyverse code: Statistical Rethinking Recoded. Both are highly recommended.

Another useful book-length course is Bayes Rules!. I personally haven’t gone through the whole thing but other folks have praised it. It’s based on the rstan interface.

Here are some links to tutorials and lessons for learning basic Bayesian statistics with brms.

Building a multilevel model in brms by Rens van de Schoot
An introduction to Bayesian multilevel models using brms by Ladislas Nalborczyk
Andrew Heiss’ blog has lots of tutorials using brms

Here is a tutorial that goes through all the cool plots and tables you can make from a brms model with Matthew Kay’s tidybayes package. I’ve shown a few of them in this lesson but there are many other visualizations available.

Extracting and visualizing tidy draws from brms models

Paul Buerkner published two papers on brms:

There are also other ways to fit Bayesian mixed models in R.

rstanarm is another R package that uses Stan behind the scenes, like brms.
MCMCglmm is another R package that fits Bayesian mixed models, but doesn’t use Stan.

You can use Stan without R as well, if that’s your thing.

PyStan is a Python interface to Stan
CmdStan is a shell interface to Stan that lets you run Stan models directly from the command line (if you’re a glutton for punishment).

Exercises

These three exercises should help you test your understanding of Bayesian statistics and using brms.

Exercise 1

image of werewolf howling at the moon
public domain image by Mont Sudbury

Imagine you read in “The Field Guide to North American Werewolves” that about 1 in every 1000 people in North America is a werewolf. You have a werewolf detector that is 99% accurate (it correctly identifies werewolves 99% of the time, and correctly identifies people as not werewolves also 99% of the time). You point your werewolf detector at the mailman and it says “DANGER! WEREWOLF DETECTED!” You use Bayes’ Theorem to calculate that the probability that the mailman really is a werewolf is 9%.

Question: In this situation, what is the data, what is the likelihood, what is the prior, and what is the posterior?

Exercise 2

Take a look at these two MCMC trace plots. Which one do you think indicates better model convergence? What would you recommend the researcher do in each case?

Trace plot A

MCMC trace plots example 1

Trace plot B

MCMC trace plots example 2

Exercise 3

Add one or more interaction terms to the model that we have been working with. For example you could test for two-way interactions N:P, N:K, and P:K. Examine the model diagnostics, show or plot the parameter estimates, and use LOO cross-validation to compare this model to the ones we fit previously. What can you conclude?

Exercise 3 hints

To include an interaction term, add the term, for example + N:P, to the fixed-effects part of the model formula.
To use LOO cross-validation, first use add_criterion(model, 'loo') and then use loo_compare().

Click here for answers

A crash course in Bayesian mixed models with brms (Lesson 1)

Quentin D. Read

September 09, 2025

Introduction

What should you know coming into this course?

Minimal prerequisites

Advanced prerequisites

What will you learn from this course?

Conceptual learning objectives

Practical skills

What is Bayesian inference?

Bayes’ Theorem

How Bayes’ theorem relates to statistical inference

A “real-life” example of Bayes’ theorem

Why is Bayes so computationally intensive?

Markov Chain Monte Carlo

What is Stan?

What is brms?

Bayesian Myths Busted

Myth 1: Bayes is confusing

Myth 2: Bayes is subjective

Myth 3: Bayes takes too long

Myth 4: You can’t be both Bayesian and frequentist

Why use Bayes?

Setup

The data

Exploratory plots

Fitting models

Intercept-only model

Fitting our first model

Diagnosing convergence problems

Dealing with convergence problems

Credible intervals

Variance decomposition

Mixed-effects model with a single fixed effect

Modifying priors

Posterior predictive check

Exploring posterior estimates

Estimated marginal means

Plotting posterior estimates of means

Mixed-effects model with multiple fixed effects

Mixed-effects model with random slopes

Interactions between fixed effects

Comparing models and assessing evidence

Comparing models with information criteria

Assessing strength of evidence

Bayes Factors

Probability of direction (pd)

Maximum a posteriori (MAP)-based p-value

Conclusion

Conceptual learning objectives

Practical skills

Going further

Exercises

Exercise 1

Exercise 2

Trace plot A

Trace plot B

Exercise 3

Exercise 3 hints

Probability of direction (p_d)