What is this course? It is a crash course in brms: a brief and practical introduction to fitting Bayesian multilevel models in R and Stan using the brms package (Bayesian Regression Models using Stan). Because this is intended as a practical skills course, I will only give a quick and surface-level description of Bayesian analysis and how it can be used for multilevel regression models.
Download the worksheet for this lesson here.
This course is designed for practicing researchers who have some experience with statistical analysis and coding in R.
If you don’t know what those packages are, don’t sweat it because you can just follow along by copying the code.
At the end of this course, you will understand …
At the end of this course, you will be able to …
This may or may not actually be a portrait of Thomas Bayes
In its simplest form, Bayesian inference is a method of statistical inference that allows you to use information you already know to assign a prior probability to a hypothesis, then update the probability of that hypothesis as you get more information. If that seems like common sense, it is!
It is a very powerful tool that has been applied across many fields of human endeavor. Today we are only going to look at its application in estimating the parameters of multilevel statistical models to analyze scientific data, but that’s only one of the many things we can use Bayesian inference for. Bayesian methods are only growing in popularity, thanks in large part to the rapid growth of user-friendly, open-source, computationally powerful software – like Stan and its companion R package brms that we are going to learn about today!
Bayesian inference is named after Reverend Thomas Bayes, an English clergyman, philosopher, and yes, statistician, who wrote a scholarly work on probability (published after his death in 1763). His essay included an algorithm to use evidence to find the distribution of the probability parameter of a binomial distribution … using what we now call Bayes’ Theorem! However, much of the foundations of Bayesian inference were really developed by Pierre-Simon Laplace independently of Bayes and at roughly the same time. Give credit where it’s due!
Bayes’ Theorem is the basis of Bayesian inference. It is a rule that describes how likely an event is to happen (its probability), based on our prior knowledge about conditions related to that event. In other words, it describes the conditional probability of an event A occurring, conditioned on the probability of another event B occurring.
The mathematical statement of Bayes’ theorem is:
\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
This means the probability of A being true given that B is true (\(P(A|B)\)) is equal to the probability that B is true given that A is true (\(P(B|A)\)) times the ratio of probabilities that A and B are true (\(\frac{P(A)}{P(B)}\)).
This may not make sense yet but some more concrete examples might help.
Let’s say \(A\) is a statistical model. A statistical model is a hypothesis about the world. We want to know the probability that our hypothesis is true. We start out with some prior knowledge about the probability that our model \(A\) is true. So \(P(A)\) is called the prior probability. We go out there and get some data \(B\) to test our hypothesis. We can calculate the probability of observing that data \(B\), given that \(A\) is true: \(P(B|A)\). This is also known as the likelihood of the model \(A\), in light of our observations \(B\). We use this to update our estimate of the probability that \(A\) is true. We now have \(P(A|B)\), the probability our model, or hypothesis, is true, given the data we just observed.
For now we can ignore the denominator, \(P(B)\), the probability that the data are true. This is called the marginal probability. Usually the best we can do is to compare the relative probability of different models, given the data. We can’t really say (nor should we say) what the absolute probability of a model is. (“All models are wrong.”) If we are comparing two models \(A_1\) and \(A_2\), both fit using the same data \(B\), and we take the ratio \(\frac{P(A_1|B)}{P(A_2|B)} = \frac{P(B|A_1)P(A_1)/P(B)}{P(B|A_2)P(A_2)/P(B)}\), the two \(P(B)\)s cancel out. So we can temporarily ignore the denominator and write Bayes’ theorem as
\[P(model|data) \propto P(data|model)P(model)\]
or
\[posterior = likelihood \times prior\]
or, flipping the left and right sides around,
what we believed before about the world (prior) × how much our new data changes our beliefs (likelihood) = what we believe now about the world (posterior)
That’s pretty cool!
Let’s say we pick up a coin off the street. We have a very strong prior belief that the true probability of flipping heads on any given flip is 0.5. Let’s say we flipped the coin 10 times and got 8 heads. We probably wouldn’t change that belief very much based on that evidence. It simply isn’t enough evidence to sway us off that belief. The likelihood of getting 8 heads, given that the true probability is 0.5, is still high enough that our posterior estimate of the probability is still around 0.5.
But let’s say a shady character on the street corner approaches us, shows us a coin, and offers to flip the coin 10 times. He offers to pay us $1 for every tails if we pay him $1 for every heads. Most likely, before we even got any new evidence, we’d be likely to think that the true probability of heads is not exactly 0.5. Or at least we would have a more skeptical prior belief, and we would be willing to entertain the possibility that the true probability of heads might be higher than 0.5. So now if we observe 8 heads, we might update our estimate of the true probability of heads to something higher, maybe 0.7.
Our shady character might’ve studied this YouTube video.
In a classical, “frequentist” analysis, there is no formal way to incorporate prior information into a statistical model. In both cases, we have no choice but to say the point estimate of the true probability of heads is 0.8.
The critical point here is that probability, in the Bayesian interpretation, incorporates our uncertain knowledge about an event. For example we might say that the probability of the stock market going up tomorrow is 55%. In the frequentist interpretation, either the stock market will go up tomorrow or it will not. Frequentist probability is just a frequency: if we hypothetically repeated “tomorrow” under the same conditions many times, 55% of the time the stock market would go up. But Bayesian probability can handle the uncertainty. When you calculate a Bayesian probability you are giving a number to a belief, a number that best reflects the state of your knowledge.
The coin flip example was pretty easy — you can do it in your head! Why do we need such advanced computational tools to do Bayesian inference?
The main reason that Bayesian statistics were not widely adopted initially is probably a practical one. It boils down to \(P(data)\), the marginal probability. We need it to estimate \(P(model|data)\), the posterior. You may have heard of “maximum likelihood” methods. Maximum likelihood algorithms are used to fit frequentist models. They try to find the set of parameters that make the data the most probable. In the Bayesian world, we are still trying to do that, but instead of getting a single point estimate of the parameters that are most likely given the data, we are now trying to estimate a joint distribution over all possible combinations of values the parameters could take, then normalize it by dividing by the marginal probability. We end up with a distribution for all the parameters that is most likely given the data we observed.
If our “model” is just a single parameter like the probability of a coin flip being heads, and our “data” is just a count of heads out of 10, the calculations are trivial. But what if our model is a multilevel regression model with a bunch of fixed effects, a bunch of random intercepts at multiple levels, and variance-covariance parameters for all the errors (like we are about to fit in this very lesson)? We might have hundreds of parameters. To estimate \(P(data)\), we have to integrate a function across n-dimensional space, where n is the total number of parameters in the model!
If you remember even trying to do two-dimensional integration on paper from calculus class, imagine doing 100-dimensional integration or even more! There just isn’t enough time in the universe to get a precise solution for an integral with that many dimensions. In the time of Thomas Bayes, and for hundreds of years after that, Bayesian statistics simply couldn’t be applied to real-world problems because of this computational limitation.
That’s where Markov Chain Monte Carlo (MCMC), aided by super-fast modern microprocessors, comes in. What is MCMC? It’s a class of algorithms for drawing samples from a probability distribution. The longer the algorithm runs, the closer it gets to the true distribution. I won’t describe it in detail here but the best beginner-level description of MCMC I’ve ever read is in Richard McElreath’s book Statistical Rethinking.
Because fitting complicated Bayesian models requires you to characterize a very high-dimensional probability distribution (the posterior), most Bayesian inference uses MCMC (although there is an alternative way to approximate Bayesian inference). Usually in practice you run multiple Markov chains for a prespecified number of samples, discard some samples from the beginning of the chains when they were still searching for the correct solution, and retaining only the samples that you think characterize the true posterior distribution.
MCMC is a class of algorithms — there’s more than one way to do it. In 1987, some physicists proposed a new MCMC algorithm called Hamiltonian Monte Carlo (HMC) to do calculations in quantum physics. I won’t describe HMC here, except to say that it really works! Older MCMC algorithms are not nearly as efficient as HMC. HMC can converge on a solution ten times faster, if not more, than other algorithms. It makes Bayesian inference so much more practical. Recently, Andrew Gelman and colleagues developed Stan, which is a software platform for Bayesian statistical modeling using HMC. The models we will fit with brms in this lesson use Stan for back-end computation.
Animation of HMC in action exploring a probability distribution
(image by GitHub user chi-feng)
Because HMC is now the state of the art for estimating high-dimensional integrals for Bayesian inference, it is not only implemented in Stan. It is also implemented in SAS PROC BGLIMM and MATLAB, among other stats packages, because it is just so damn good.
One hurdle for getting more people to use Stan is that you have to write quite a lot of code to specify a model in Stan. There are a lot of “tricks” that Stan users have developed to optimize their models and make them run faster. It can be hard to figure out how to implement all those tricks in the context of your statistical model. Luckily, a statistician named Paul Bürkner developed an R package called brms. brms lets you set up regression models in R with very simple syntax. Behind the scenes, it translates your R code into Stan code, complete with all the tricks of scaling parameters, putting sensible prior distributions on parameters, and then runs the Stan model for you.
Out of the thousands of R packages out there, brms is far and away my favorite. Paul is extremely dedicated to his work. He has been very responsive to any issues I’ve raised and has added new features to brms at my request. Danke, Paul!
We are going to use brms to fit Bayesian linear mixed models today.
Before we get cracking on the code, let’s go over some popular misconceptions about Bayes.
BUSTED! I definitely empathize that some of the Bayesian approach is confusing at first. It took me many repetitions of learning the basic concepts through classes and self-study to understand it, and I’m still working on it. But I think this is really more because frequentist techniques were the only ones taught for many decades. You could even argue that the frequentist approach, which focuses on rejecting null hypotheses and requires you to imagine a hypothetical population that may not exist to calculate long-run frequencies of events, is more confusing and less intuitive than the Bayesian approach.
BUSTED! This is a particularly dangerous misconception. The statistician R. A. Fisher, who invented many frequentist techniques, used accusations of subjectivity to discredit Bayesian statisticians. Because Fisher was so influential, everyone listened to him and Bayesian statistics fell out of favor for decades. Luckily, the pendulum is swinging back the other way.
In fact, Bayesian analysis is no more subjective than frequentist analysis. It’s not possible to be completely objective — we are human beings, not Vulcans! All statistical approaches make assumptions and require prior knowledge. Also, I would argue that incorporating prior information is not a subjective bias to be avoided, but a good thing we should strive for. Ignoring prior information could be just as bad as selectively cherry-picking data to include in an analysis. Either way, you are throwing out data that might not fit your story.
BUSTED … sort of. Yes, there’s really no denying that Bayes is very computationally intensive, requiring lots of computing time and memory. There are two main ways around that. One is to come up with really smart algorithms for Monte Carlo sampling that quickly converge on the correct solution. The other is to come up with other kinds of cool algorithms that approximate the solution of the integral without having to do the full MC sampling. brms uses the first approach, but packages such as INLA follow the second strategy, which is especially useful for fitting complex geospatial models.
BUSTED! It is important not to get too attached to one or the other statistical approach. Both have advantages and disadvantages and it’s best to see them as different tools in your toolkit. If you treat it like dogma, it will harm your science. What is more, the differences between Bayesian and frequentist analysis are a little bit overblown. Techniques like model selection, cross-validation, and penalized regression that are becoming more and more common across all branches of statistics, especially with the rise of machine learning, have a very similar effect to priors in a Bayesian analysis.
After all that, you might be asking yourself, if it’s so hard to do Bayesian inference, why bother? Maximum likelihood frequentist methods work really well! Well, they do, up to a point. There are some kinds of multilevel model that frequentist methods have a lot of trouble estimating the parameters for. Sometimes the maximum likelihood algorithms fail to find a solution. So even if you aren’t completely convinced by the philosophical reasons to use Bayesian inference, it can be very practical from a computational standpoint. Adding prior information helps stabilize model fitting algorithms, so you can estimate models with Bayesian methods that simply can’t be estimated any other way.
Philosophically, there is another big advantage to Bayesian inference. Classical statistical methods are all based around rejecting a null hypothesis. This is very reductionist because it reduces real-world questions about how strong an effect is to a black-and-white yes-or-no answer. Even if you don’t worry about the \(p<0.05\) threshold of significance, you are still testing a null hypothesis rather than the really interesting hypothesis you want to test. In Bayesian inference, we use the posterior distribution to say what we believe the size of the effect is. Bayesian inference lets us directly say “the probability that the effect is between X and Y is 95%” while in a frequentist analysis you have to say “if we ran the experiment again many times, our estimate of the parameter would be between X and Y 95% of the time.”
In summary, Bayesian analysis lets us leverage our prior knowledge, more honestly describe uncertainty, test the hypothesis we really want to test, and more efficiently get correct answers to very complicated statistical questions.
I think I have gone on long enough talking about why to use Bayesian inference. Let’s try to do some Bayesian modeling ourselves with brms to show how this powerful method works in practice!
You should have already installed the R package brms and all its dependencies. We will also be using the tidyverse, emmeans, tidybayes, and easystats packages today. Also note the logspline package is necessary to have installed although we will not be calling any functions from it directly. Install all those packages if you haven’t yet.
library(brms)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(easystats)
library(logspline)
Set a plotting theme. Create a directory called fits
in
your current working directory to store fitted model objects, if it does
not already exist.
theme_set(theme_minimal())
if (!dir.exists('fits')) dir.create('fits')
Set brms options. The option
mc.cores = 4
means we will use four processor cores in
parallel to fit the models, meaning we can allow four Markov chains to
sample simultaneously. The options
brms.file_refit = 'on_change'
means that if a model has
already been fit and saved to a file, and the code and data are
unchanged, the model will be reloaded from the file instead of having to
fit it again. If you are running this lesson on the cloud server, there
are some pre-fit model objects to save time during the workshop.
options(mc.cores = 4, brms.file_refit = 'on_change')
If you are running this code on the Posit Cloud server, set these options instead of the ones above. This will use the faster and less memory-intensive cmdstanr package to fit the models behind the scenes. Unfortunately, the CmdStan software that cmdstanr uses is difficult to install on USDA laptops, but it can be done. Check out instructions for installing CmdStan here, and if those don’t work, instructions for a workaround to install CmdStan using Anaconda. If you end up wanting to run Bayesian models on your own data after this workshop, another option is to run brms code on the SciNet cluster, where the cmdstanr package can easily be configured.
options(mc.cores = 4, brms.file_refit = 'on_change', brms.backend = 'cmdstanr')
Read the data from a CSV file to a data frame. It is hosted on the
GitHub repository for this course. Alternatively, if you are doing this
course on the cloud server, you may read the data from the local
data
directory.
stvincent <- read_csv('https://usda-ree-ars.github.io/SEAStats/brms_crash_course/stvincent.csv')
stvincent <- read_csv('data/stvincent.csv')
View of St. Vincent (image from Wikimedia Commons)
Let’s examine the data. These data come from the agridat package which has tons of great example datasets from agricultural science. I’ve modified it slightly for this workshop. The data come from a maize fertilization trial on the Caribbean island of St. Vincent. It is a block design repeated at nine sites. At each site, there were four replicates (blocks). Each block contained nine plots each of which received a different combination of nitrogen (N), phosphorus (P), and potassium (K) fertilizer. It is not a complete block design because there were actually twelve different treatment combinations and each block only had a subset of those combinations, with some combinations represented in multiple plots within each block. Maize yield was measured in each plot.
glimpse(stvincent)
## Rows: 324
## Columns: 7
## $ site <chr> "CPSV", "CPSV", "CPSV", "CPSV", "CPSV", "CPSV", "CPSV", "CPSV", …
## $ block <chr> "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B1", "B2", "B2"…
## $ plot <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ N <chr> "0N", "1N", "1N", "0N", "2N", "2N", "1N", "1N", "3N", "1N", "3N"…
## $ P <chr> "2P", "3P", "1P", "0P", "0P", "2P", "1P", "1P", "1P", "1P", "1P"…
## $ K <chr> "2K", "1K", "1K", "0K", "2K", "0K", "1K", "3K", "1K", "1K", "1K"…
## $ yield <dbl> 24.0, 27.1, 26.5, 23.1, 22.1, 24.1, 26.1, 22.5, 23.0, 26.4, 24.8…
Here is a description of the columns:
site
: a character ID for each site (i.e., environment)
where responses were measured. There are nine sites.block
: a character ID for each block, or rep. Each site
contains four blocks labeled 'B1'
through
'B4'
. Note that the IDs are nested, meaning the same ID is
reused from site to site. For example 'B1'
at the site
'CPSV'
is not the same block as 'B1'
at
'CASV'
, though they share a label.plot
: an integer ID for each plot. The numbering begins
at 1 for each site.N
: a character variable with four discrete levels of
nitrogen addition treatment from 0N
to
3N
.P
: a character variable with four discrete levels of
phosphorus addition treatment from 0P
to
3P
.K
: a character variable with four discrete levels of
potassium addition treatment from 0K
to
3K
.yield
: a continuous variable indicating the maize yield
at the plot level. This is the outcome variable we are interested in
explaining.In a study like this, we might expect levels of the different nutrients to affect yield. In addition, we might expect unmeasured variables at both the field level and the block level to affect yield. Further, we might want to see whether there is a difference in the strength of the relationship between, say, nitrogen and yield among the different environments (sites). Thus this is an ideal situation for a multilevel model!
Here we say that block is nested within site, and plot is nested within block. This is because the blocks at each site are unique, as are the plots within each block.
Let’s change all the blocking and discrete treatment variables to factors, to make it clear what they represent.
stvincent <- stvincent %>%
mutate(across(c(site, block, plot, N, P, K), factor))
Let’s take a look at the relationship between N fertilization and yield, using box plots. (I won’t explain the ggplot2 code in this lesson.)
(yield_vs_N <- ggplot(data = stvincent, aes(x = N, y = yield)) +
geom_boxplot() +
ggtitle('Yield by N fertilization treatment'))
Right now, this completely ignores the multilevel structure of the dataset. The yield values of plots from the same field are not independent of one another. Ignoring that wrinkle and eyeballing the graph, it looks like there’s a reasonably large effect of N fertilization on yield, especially between adding 0 N and the three fertilization treatments. But there is a lot of variation in yield within each treatment.
Now let’s start to take the nested structure of the data into account. Separate the boxplots by site.
yield_vs_N +
facet_wrap(~ site) +
ggtitle('Yield by N fertilization treatment', 'separately by site')
You can see that the plot that combined all data points from the different sites together didn’t tell the whole story. The effect of N treatment on yield is very different between the sites. Some sites have a very strong response while others don’t have any response. Also, we can see that the “baseline” yield in unfertilized plots is quite variable among the sites. A multilevel model helps us estimate the overall effect while taking into account the variation between sites. That can mean both how the baseline yield varies by site (random intercept) and how the treatment effect varies by site (random slope).
Let’s also make boxplots by site showing the effect of P and K fertilization on yield.
ggplot(data = stvincent, aes(x = P, y = yield)) +
geom_boxplot() + facet_wrap(~ site) +
ggtitle('Yield by P fertilization treatment', 'separately by site')
ggplot(data = stvincent, aes(x = K, y = yield)) +
geom_boxplot() + facet_wrap(~ site) +
ggtitle('Yield by K fertilization treatment', 'separately by site')
Generally, we can see from these plots that the effects of P and K fertilization on yield are a little less pronounced. But we can still see some variation in response among sites.
It would also be good to make some plots to look at the interactions between the different fertilizer treatments. I will leave that to you.
If you have ever fit a mixed model using lme4, this will look familiar to you:
lmer(yield ~ 1 + (1 | site) + (1 | block:site), data = stvincent)
That is the syntax for fitting a mixed-effects model:
yield
) is on the
left side of the formula.~
separates the dependent from the independent
variables.1
).(1 | site)
) has
a design side (on the left hand) and group side (on
the right hand) separated by |
.1
on the design side means only fit
random intercepts and no random slopes.site
on the group side means each site will get its
own random intercept.block:site
on the
group side means that each unique combination of block and site will get
its own random intercept.brms allows you to fit a Bayesian multilevel model with very similar syntax to the frequentist lme4. But instead of just getting maximum likelihood point estimates of the model parameters (the fixed and random effects), we’re going to get posterior distributions for all of them! Get hyped!!!
But to do this we need to give a few extra directions to the Hamiltonian Monte Carlo sampling algorithm:
If we don’t specify any of these, brms will use default values for number of chains and iterations, and it will assign random values for the initial values.
You may notice I have made no mention of priors yet. Another great thing about brms is that if you do not specify priors, it will assign uninformative priors that are at least plausible for each of the parameters. (For example, if the parameter is a variance parameter that must be positive, it will assign a prior distribution that is truncated at zero). This can be a major time saver but it can also be a dangerous thing. We will revisit priors later. For now we will skip specifying the priors and use the defaults.
Without further ado, here is our first Bayesian multilevel model!
Notice we have specified the model formula, the data, and the number of chains (2). Each chain will run for 200 total iterations, of which the first 100 are warmup when the sampler will be determining the optimal “jump” size to make at each iteration. We will get 200 - 100 = 100 draws from the posterior distribution for each chain. Finally, we let the initial values for the parameters all be random between -2 and 2 (default). We set a random seed to ensure that the results are replicable.
fit_interceptonly <- brm(yield ~ 1 + (1 | site) + (1 | block:site),
data = stvincent,
chains = 2,
iter = 200,
warmup = 100,
init = 'random',
seed = 1)
## Running MCMC with 2 chains, at most 4 in parallel...
##
## Chain 2 WARNING: There aren't enough warmup iterations to fit the
## Chain 2 three stages of adaptation as currently configured.
## Chain 2 Reducing each adaptation stage to 15%/75%/10% of
## Chain 2 the given number of warmup iterations:
## Chain 2 init_buffer = 15
## Chain 2 adapt_window = 75
## Chain 2 term_buffer = 10
## Chain 2 Iteration: 1 / 200 [ 0%] (Warmup)
## Chain 1 WARNING: There aren't enough warmup iterations to fit the
## Chain 1 three stages of adaptation as currently configured.
## Chain 1 Reducing each adaptation stage to 15%/75%/10% of
## Chain 1 the given number of warmup iterations:
## Chain 1 init_buffer = 15
## Chain 1 adapt_window = 75
## Chain 1 term_buffer = 10
## Chain 1 Iteration: 1 / 200 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 200 [ 50%] (Warmup)
## Chain 1 Iteration: 101 / 200 [ 50%] (Sampling)
## Chain 2 Iteration: 100 / 200 [ 50%] (Warmup)
## Chain 2 Iteration: 101 / 200 [ 50%] (Sampling)
## Chain 1 Iteration: 200 / 200 [100%] (Sampling)
## Chain 1 finished in 2.3 seconds.
## Chain 2 Iteration: 200 / 200 [100%] (Sampling)
## Chain 2 finished in 2.6 seconds.
##
## Both chains finished successfully.
## Mean chain execution time: 2.5 seconds.
## Total execution time: 3.5 seconds.
The model takes a minute or two to compile the C++ code but only two or three seconds to sample.
Use the summary()
function to see the output.
summary(fit_interceptonly)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: yield ~ 1 + (1 | site) + (1 | block:site)
## Data: stvincent (Number of observations: 324)
## Draws: 2 chains, each with iter = 200; warmup = 100; thin = 1;
## total post-warmup draws = 200
##
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.11 0.41 0.34 1.95 1.03 69 57
##
## ~site (Number of levels: 9)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 4.86 1.60 2.93 9.01 1.04 48 67
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 15.92 1.89 12.38 20.55 1.09 20 26
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 3.94 0.18 3.59 4.34 1.02 189 93
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Before we can interpret any of the model’s estimates, we must confirm that the Markov chains converged. Convergence happens when all the chains, regardless of where they started, reach the same distribution and stay there. If the chains get stuck in different local maxima of the likelihood, or are still searching around the parameter space to find the highest likelihood, they haven’t converged and we can’t learn anything from the posterior distributions the model gives us.
A good way to check whether the model converged is to look at the maximum \(\hat{R}\), or “R-hat”, statistic for all the parameters of the model.
max(rhat(fit_interceptonly))
## [1] 1.088863
The largest value is greater than 1.05. What does this mean? R-hat is a statistic used to roughly indicate when multiple MCMC chains have converged on the same distribution. As chains approach convergence, the R-hat statistic approaches 1. It is ideal for R-hat values for all parameters of a model to be < 1.01, but at a bare minimum we need to have all R-hat values < 1.05.
We can also plot()
the model object to get density plots
and trace plots of the posterior distribution (the 100 post-warmup
samples for each Markov chain).
plot(fit_interceptonly)
What does this show? We see diagnostics for the four most important parameters of the model:
b_Intercept
). All fixed
effect parameters in brm
models begin with b
,
short for beta
.sd_block:site__Intercept
)sd_site__Intercept
)sigma
)On the left are density plots for the posterior distributions of the parameters (made from the 200 posterior samples, 100 from each of the 2 chains).
On the right are trace plots for the 100 post-warmup posterior samples for each of the chains separately. The HMC algorithm is so good that even with purely default priors and only a very small number of samples, we actually get pretty close to convergence. That would never be possible with older algorithms, which would have required thousands of samples. But they aren’t quite converged. We still see that there are some trends in the trace plots where sometimes a chain is exploring one part of parameter space and sometimes another. Also, the two chains do not agree very well. This is a sign of failure to converge. The more complicated your models get, the more you will see this.
What do we do about this? We can either increase the number of iterations or set different priors. For now, let’s increase the number of iterations to 1000 warmup and 1000 sampling (2000 total). We will get a total of 2000 posterior samples. This is usually the minimum number of warmup iterations and posterior samples you will want; many models may require more.
We can use the update()
method to draw more samples from
the posterior distribution without having to recompile the model
code.
Note: from now on I will not show the output of the brms progress indicator in this tutorial, but you will continue to see it in your R console as you run the code.
fit_interceptonly_moresamples <- update(fit_interceptonly, chains = 2, iter = 2000, warmup = 1000)
Now let’s look at the convergence diagnostic plots.
plot(fit_interceptonly_moresamples)
The posterior distributions’ density plots look smoother than in the previous plot (this is good) and the trace plots look like “hairy caterpillars” — successive samples just move back and forth around a central tendency, with no trend. There is nearly complete overlap between chain 1 and chain 2 — the chains are “mixing” well and so we can say from eyeballing it that we have converged on the solution.
Let’s call summary()
again to look at the convergence
diagnostic Rhat
as well as some information about the
parameters.
summary(fit_interceptonly_moresamples)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: yield ~ 1 + (1 | site) + (1 | block:site)
## Data: stvincent (Number of observations: 324)
## Draws: 2 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 2000
##
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.05 0.44 0.08 1.89 1.00 363 164
##
## ~site (Number of levels: 9)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 4.70 1.34 2.80 7.99 1.00 586 1036
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 16.05 1.61 12.68 19.19 1.00 339 718
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 3.93 0.16 3.64 4.26 1.00 1921 1080
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
We see low numbers for Rhat
. Looking at the parameter
values, we can see the fixed-effect or “population-level” intercept is
~16, with a 95% credible interval ranging from ~13 to ~19. We can also
see the standard deviation for the “group-level” or random intercepts
for both block and site sd(Intercept)
and for the plots
within blocks (the standard deviation of the overall model residuals
sigma
).
NOTE: The brms output says “group-level” effects for what are commonly called random effects, and “population-level” effects for what are commonly called fixed effects. This makes more sense in a Bayesian context because everything is random to some extent, including the fixed effects. But I will continue to call them random and fixed effects because that is still the most common way to talk about them, and the new terminology hasn’t really caught on yet.
But what is this “credible interval” thing? Doesn’t CI stand for confidence interval? You might be asking yourself this. Well, the posterior distribution for each parameter is a bunch of samples. We can calculate quantiles from this distribution. For example, we are 90% confident that the true value of the overall mean of yield across all fields is somewhere between the 5th and 95th percentile of the posterior distribution of that value. This is much more straightforward than the interpretation of a confidence interval from a frequentist model. But unfortunately someone chose “credible interval” for its name which has the same initials as “confidence interval.” In this lesson we will use quantile-based credible intervals (though there are other kinds) which you sometimes see called QI for quantile interval.
We can pull out all 2000 samples from the posterior distribution of
the intercept and calculate or plot any credible interval we want. The
function as_draws_df()
gets all draws from the posterior
distribution for each parameter in a model and puts them into columns of
a data frame.
For example here is the median and the 90% quantile credible interval for the intercept.
post_samples <- as_draws_df(fit_interceptonly_moresamples)
post_samples_intercept <- post_samples$b_Intercept
median(post_samples_intercept)
## [1] 16.08135
quantile(post_samples_intercept, c(0.05, 0.95))
## 5% 95%
## 13.50850 18.65957
A useful thing about Bayesian models is that you can get a posterior distribution, and any credible interval you want, for any component of the model: parameter estimates, predicted values, etc. This is much harder to do in a frequentist model. We will see some of that later on in the model — and we will see some really cool ways to visualize the information contained in posterior distributions, including credible intervals.
It is a good idea to examine the proportions of variance at different
levels in your nested dataset. In this case we can calculate the ICC
(intraclass correlation coefficient) for each random effect level. This
is made easy by the function icc()
from the performance
package (part of easystats).
iccs <- icc(fit_interceptonly_moresamples, by_group = TRUE)
iccs
## # ICC by Group
##
## Group | ICC
## ------------------
## block:site | 0.028
## site | 0.570
This shows that the proportion of variance among blocks within site is not very high, but variance among sites is quite high. This indicates that a multilevel model including site as a random effect is necessary. Actually, if you have a dataset with a grouped structure, I would recommend still using a multilevel model even if the variance decomposition does not necessarily say it’s warranted. It’s better to stick with the model that your study design calls for. Thus we will keep both the block and site random terms in the following models we fit.
The intercept-only model is a good start. But so far all we’ve done is pull out the random variation in mean yield by field. We want to make inference about what factors influence yield at the individual plot level. To do this, we can add a fixed effect predictor (slope) to the model.
We’ll start by adding a fixed effect for the N fertilization treatment. Because the slope (N treatment effect) is fixed and not random, that means we are assuming that the effect of N on yield is a single “global” effect that is the same across all sites, and across all blocks within sites. Also, note we still haven’t specified our own priors, so brms is defaulting to very wide and uninformative priors.
Finally, we are going to use 4 Markov chains from now on. It is good in practice to use at least 3 or 4 because the more chains you use, the more confident you can be that you have the correct answer if they converge.
fit_fixedN <- brm(yield ~ 1 + N + (1 | site) + (1 | block:site),
data = stvincent,
chains = 4,
iter = 2000,
warmup = 1000,
seed = 2,
file = 'fits/fit_fixedN')
Notice the seed
and file
arguments.
seed
argument (you can use any integer you want)
sets the random seed so that we can reproduce the exact behavior of the
MCMC chains again in case we need to refit the model. This is very
important for reproducibility!file
argument means a file called
fit_fixedN.rds
is created in a subdirectory called
fits
in your current working directory. This means that you
can reload the fitted model object in a future R session without having
to compile and sample the model again. If you run the same
brm
code again and the data haven’t changed, it will look
for the file and load it instead of refitting the model. This can save a
lot of time if you have a model that takes hours or days to sample!summary()
shows low R-hat statistics and also shows that
most of the mass of the posterior distribution for the three N effects
is well above zero. This indicates evidence that the 1N
,
2N
, and 3N
treatments have a higher mean yield
than the 0N
treatment. We have roughly the same random
effect variance parameters as before, but now sigma
(the
standard deviation of the residuals for individual plots) is lower than
before. This makes sense because we have some effects in the model that
explain some of the variance and result in lower residuals.
summary(fit_fixedN)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: yield ~ 1 + N + (1 | site) + (1 | block:site)
## Data: stvincent (Number of observations: 324)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.26 0.36 0.57 2.00 1.00 1398 1403
##
## ~site (Number of levels: 9)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 4.71 1.32 2.80 7.87 1.00 1009 1196
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 13.15 1.58 10.01 16.42 1.00 796 1304
## N1N 3.41 0.51 2.42 4.39 1.00 3135 3097
## N2N 3.95 0.59 2.81 5.09 1.00 3549 2928
## N3N 4.98 0.71 3.60 6.37 1.00 3513 3096
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 3.52 0.15 3.24 3.83 1.00 4090 3159
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Here let’s also inspect what default priors brms
chose for us, and show how to modify them to see if there was any effect
on the results. prior_summary()
shows us the priors used in
fitting a model.
prior_summary(fit_fixedN)
## prior class coef group resp dpar nlpar lb ub
## (flat) b
## (flat) b N1N
## (flat) b N2N
## (flat) b N3N
## student_t(3, 15.1, 6.7) Intercept
## student_t(3, 0, 6.7) sd 0
## student_t(3, 0, 6.7) sd block:site 0
## student_t(3, 0, 6.7) sd Intercept block:site 0
## student_t(3, 0, 6.7) sd site 0
## student_t(3, 0, 6.7) sd Intercept site 0
## student_t(3, 0, 6.7) sigma 0
## source
## default
## (vectorized)
## (vectorized)
## (vectorized)
## default
## default
## (vectorized)
## (vectorized)
## (vectorized)
## (vectorized)
## default
This shows us that t-distributions were used for the priors on the
intercept, the random effect SD, and the residual SD sigma
.
The mean of the t-distribution used as the prior on the intercept is
15.1 (this was gotten by just calculating the mean of the yield values
in the data) which makes sense. The mean of the variance parameters is 0
but the lower bound lb
is also 0. This means we have
something like “half bell curves” for the variance parameters. The
standard deviation of all the t-distribution priors is 6.7. These are
sensible defaults and rarely need to be modified.
What about the priors on the fixed effect slopes? They are flat priors. This means we are assigning exactly the same prior probability to any possible value the slope could take. Yes, that’s an uninformative prior, but in practice it doesn’t make sense to do this. Take the prior on 1N for example. We are saying we assume that the prior probability that 0N and 1N have 0 difference in yield is equal to the prior probability that 1N has 10000 more units of yield than 0N, or -12345 units less, and so on. This makes no sense (especially because in our dataset yield only ranges between about 3 and 30)! Luckily, we have a simple model and a decent amount of data so the model converges on a reasonable solution even with this flat prior. But if you have a more complex model, it is often a bad idea to let brms use these weak flat default priors. So even if you philosophically don’t want to “put your thumb on the scale,” it’s a good idea for practical reasons to specify at least a weak prior.
Let’s refit the model with reasonable priors on the fixed effect
slopes. A normal distribution with mean 0 and a modest standard
deviation, say 10, is a good choice. I like to think of it as a “null
hypothesis” because it has equally as much prior probability for
positive coefficients as negative. Also it allows for the possibility of
large effect sizes: in this context a slope of +10 or -10 would mean
that going from one N fertilizer level to another would increase or
decrease yield by 10 units, which is a lot for this dataset. And 10 is
only one standard deviation so we are still allowing a substantial prior
probability that the effect could be much greater than that. Thus we are
being very agnostic and not assuming anything much about the effect. But
it usually is beneficial for model convergence because there is very
little chance the sampler will waste time trying out ridiculous values
like 10000. We use the prior
argument to brm
and specify a vector of objects created with the prior()
function. class = b
means apply the same prior to all fixed
effect slope parameters.
fit_fixedN_priors <- brm(yield ~ 1 + N + (1 | site) + (1 | block:site),
data = stvincent,
prior = c(prior(normal(0, 10), class = b)),
chains = 4,
iter = 2000,
warmup = 1000,
seed = 2,
file = 'fits/fit_fixedN_priors')
summary(fit_fixedN_priors)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: yield ~ 1 + N + (1 | site) + (1 | block:site)
## Data: stvincent (Number of observations: 324)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.25 0.38 0.52 2.02 1.00 1047 856
##
## ~site (Number of levels: 9)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 4.72 1.30 2.83 7.90 1.00 837 1638
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 13.11 1.59 9.95 16.16 1.00 891 1670
## N1N 3.39 0.50 2.43 4.39 1.00 2791 2798
## N2N 3.93 0.58 2.79 5.10 1.00 3016 3166
## N3N 4.94 0.71 3.53 6.39 1.00 2929 2694
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 3.53 0.15 3.25 3.84 1.00 3363 3014
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
In this case, the new priors have a very modest effect on the slope coefficients. They are a little bit lower than when using the flat default priors. We have formally incorporated information about what we believe is a plausible effect size into the model. For the most part, the data spoke for themselves and the prior had little influence. But it’s a good idea to be mindful of what you are doing and not blindly use default priors all the time. And it goes without saying that you should always report what priors you used in the methods of any publication where you present this kind of analysis.
Whenever you fit a model, another good diagnostic to look at is the
posterior predictive check. Use pp_check()
to
automatically generate a plot.
pp_check(fit_fixedN_priors)
This plot shows a density plot of the observed data (black line) and density plots for the fitted values of the model for (by default) 10 random samples from the posterior distribution of the parameters (thin blue lines). Here, you can see that the agreement is close but not perfect. The observed data has a bimodal distribution, but that is not captured by the posterior samples, which look more or less like bell curves with only one peak. This may indicate we are leaving something important out of our model. However if we fit a more complex model, it is important not to make it too complex and overfit to the data. Later we will look at ways to compare models more formally.
Now let’s look a bit closer at the posterior distributions of the
fixed effect slope parameters. summary()
only gives us a
point estimate and the upper and lower bounds of the 95% quantile-based
credible interval. That’s a good start but we might want to explore the
posterior distribution in more detail. After all, Bayesian inference
gives us a fully characterized uncertainty distribution. We have 4000
posterior samples for each parameter. Let’s use some functions from the
tidybayes
package to make tables and plots of the slope parameters.
tidybayes is an excellent package with lots of cool
tricks for manipulating and visualizing the output of a
brms model.
First we use gather_draws()
to pull out the posterior
samples for some variables of our choosing and make a “tidy” data frame
out of them.
posterior_slopes <- gather_draws(fit_fixedN_priors, b_Intercept, b_N1N, b_N2N, b_N3N)
posterior_slopes
## # A tibble: 16,000 × 5
## # Groups: .variable [4]
## .chain .iteration .draw .variable .value
## <int> <int> <int> <chr> <dbl>
## 1 1 1 1 b_Intercept 14.3
## 2 1 2 2 b_Intercept 13.8
## 3 1 3 3 b_Intercept 11.7
## 4 1 4 4 b_Intercept 11.5
## 5 1 5 5 b_Intercept 11.5
## 6 1 6 6 b_Intercept 11.9
## 7 1 7 7 b_Intercept 11.5
## 8 1 8 8 b_Intercept 12.1
## 9 1 9 9 b_Intercept 12.3
## 10 1 10 10 b_Intercept 13.7
## # ℹ 15,990 more rows
The result is 16,000 rows (4000 posterior samples for each of the intercept and three slope parameters) and has columns indicating which chain and which iteration each sample comes from. Because the chains are all converged and mixed, we can ignore which chain each sample comes from and pool them all together.
We can calculate summary statistics (quantiles) for each of the
parameters with median_qi()
. By default it gives 95%
intervals but we can specify any width(s).
posterior_slopes %>%
median_qi(.width = c(.66, .95, .99))
## # A tibble: 12 × 7
## .variable .value .lower .upper .width .point .interval
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 b_Intercept 13.1 11.6 14.6 0.66 median qi
## 2 b_N1N 3.39 2.92 3.85 0.66 median qi
## 3 b_N2N 3.93 3.38 4.48 0.66 median qi
## 4 b_N3N 4.94 4.25 5.61 0.66 median qi
## 5 b_Intercept 13.1 9.95 16.2 0.95 median qi
## 6 b_N1N 3.39 2.43 4.39 0.95 median qi
## 7 b_N2N 3.93 2.79 5.10 0.95 median qi
## 8 b_N3N 4.94 3.53 6.39 0.95 median qi
## 9 b_Intercept 13.1 8.69 17.4 0.99 median qi
## 10 b_N1N 3.39 2.13 4.70 0.99 median qi
## 11 b_N2N 3.93 2.43 5.49 0.99 median qi
## 12 b_N3N 4.94 3.14 6.79 0.99 median qi
The default model summary gives us four fixed effect parameters. The intercept is the model’s posterior estimate of mean yield at the reference level (0 added N). The three slopes are the difference between the reference level and the other three treatments. But it would be nice to also know what the model predicts for mean yield in the other three treatments, and to estimate how big of a difference it predicts between other pairs of treatments not including the 0 N treatment. We can estimate marginal means for the treatments in a Bayesian model just like we do in a frequentist model. We do this by working with the posterior distributions. The marginal means are estimated for each draw from the posteriors.
Let’s do an example “by hand” first. To find the mean yield prediction for the 1 N treatment, we add the posterior samples of the intercept (mean in the 0 N treatment) and the posterior samples of the slope for 1 N (difference between 0 N and 1 N treatments).
posterior_mean <- pivot_wider(posterior_slopes, names_from = .variable, values_from = .value) %>%
mutate(mean_1N = b_Intercept + b_N1N)
median_qi(posterior_mean$mean_1N)
## y ymin ymax .width .point .interval
## 1 16.54183 13.32564 19.44724 0.95 median qi
We see that the median of the distribution is ~16 with a 95% credible interval ranging from ~13 to ~20.
We can use the function emmeans()
from the emmeans
package which implements all the math of estimating the marginal means
for us. This saves us the trouble of writing out the code to sum up all
the posterior samples:
N_emmeans <- emmeans(fit_fixedN_priors, ~ N)
N_emmeans
## N emmean lower.HPD upper.HPD
## 0N 13.1 10.1 16.3
## 1N 16.5 13.3 19.5
## 2N 17.1 13.7 20.0
## 3N 18.1 14.7 21.1
##
## Point estimate displayed: median
## HPD interval probability: 0.95
What if we wanted to test hypotheses about differences between
specific pairs of means? We can do that with our Bayesian model by
subtracting the posterior samples of one mean from the other. The
contrast()
function in the emmeans package
makes that straightforward:
contrast(N_emmeans, method = 'pairwise')
## contrast estimate lower.HPD upper.HPD
## 0N - 1N -3.386 -4.40 -2.443
## 0N - 2N -3.927 -5.05 -2.753
## 0N - 3N -4.945 -6.43 -3.591
## 1N - 2N -0.536 -1.51 0.442
## 1N - 3N -1.558 -2.80 -0.247
## 2N - 3N -1.010 -2.39 0.379
##
## Point estimate displayed: median
## HPD interval probability: 0.95
Instead of trying to reject a null hypothesis that any of these differences is equal to zero, we look at the effect size and credible interval for each difference. We can see that there is evidence that the 0 N treatment has lower yield than the other 3 treatments. For example, the posterior probability is 95% that the difference between 0 N and 1 N is between ~-4.5 and ~-2.5. But when comparing 1 N and 2 N, we see the posterior median of the difference is only ~-0.5 and the posterior probability is 95% that the difference is between ~-1.5 and ~0.5. This is consistent with no difference (but also consistent with a small but nonzero difference)!
There are Bayesian versions of null hypothesis tests, too, which we’ll cover shortly.
We can also use some special ggplot2 extensions to
make different kinds of plots that illustrate the shape of the posterior
and the credible intervals. Here are two ways of doing it. First we pull
out the posterior samples from the emmeans
object with the
function gather_emmeans_draws()
.
We can make a density plot of the posterior samples of the estimated marginal means, with the median and some credible intervals shown as error bars.
post_emm_draws <- gather_emmeans_draws(N_emmeans)
ggplot(post_emm_draws, aes(y = N, x = .value)) +
stat_halfeye(.width = c(.8, .95)) +
labs(x = 'estimated marginal mean yield')
I also like to plot the credible intervals as shaded bars, with the median as a point estimate in the middle.
ggplot(post_emm_draws, aes(y = N, x = .value)) +
stat_interval() +
stat_summary(fun = median, geom = 'point', size = 2) +
scale_color_brewer(palette = 'Blues') +
labs(x = 'estimated marginal mean yield')
Because these posteriors are so symmetrical and resemble normal distributions, the credible intervals are a good way of describing the distribution and it isn’t really necessary to show the density plots. But for many skewed posteriors, it is often helpful to illustrate the shape of the distribution.
We will show more examples of how to summarize and visualize predictions later on.
We have ignored the P and K treatments so far. Let’s include those
terms in the model as well. We just use the +
sign to add
additional terms to the fixed effect portion of the model specification.
We will continue to use our weakly informative priors on the fixed
effect slopes.
fit_fixedNPK <- brm(yield ~ 1 + N + P + K + (1 | site) + (1 | block:site),
data = stvincent,
prior = c(prior(normal(0, 10), class = b)),
chains = 4,
iter = 2000,
warmup = 1000,
seed = 3,
file = 'fits/fit_fixedNPK')
Note: I am not going to include
plot(fit)
orpp_check(fit)
explicitly in this tutorial anymore in the interest of space. But you should always generate those diagnostics for any model you fit as a matter of routine.
Let’s look at the summary of this fit.
summary(fit_fixedNPK)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: yield ~ 1 + N + P + K + (1 | site) + (1 | block:site)
## Data: stvincent (Number of observations: 324)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.25 0.36 0.56 1.99 1.00 1322 1444
##
## ~site (Number of levels: 9)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 4.73 1.35 2.78 7.87 1.00 1512 1949
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 12.87 1.67 9.57 16.10 1.00 1426 1898
## N1N 0.58 5.84 -10.97 11.66 1.00 2169 2067
## N2N 3.93 0.57 2.80 5.05 1.00 6215 2992
## N3N 1.95 5.86 -9.72 13.26 1.00 2081 2072
## P1P 1.92 5.96 -9.63 13.57 1.00 2533 2460
## P2P 0.75 0.57 -0.37 1.88 1.00 5781 2886
## P3P 0.77 5.96 -10.87 12.49 1.00 2495 2474
## K1K 1.30 5.81 -10.12 12.55 1.00 2760 2126
## K2K -0.37 0.57 -1.51 0.76 1.00 5832 2821
## K3K 1.62 5.82 -9.78 12.84 1.00 2782 2377
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 3.52 0.15 3.24 3.82 1.00 3885 2962
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
We have many more fixed effect parameters than before. Now that we’ve added more predictors, the variation explained by N addition is less and there is less evidence for difference between the means of the different N treatments. The 95% credible interval of most of the slopes is consistent with no effect, but we still see that most of the probability mass for the 2N term is well above 0.
Feel free to explore the predictions of this model on your own. Next we will add a bit more complexity to the model by incorporating random slopes.
So far, we have assumed that the effects of N, P, and K are constant across all the sites. We’ve allowed the intercept for yield (representing the baseline yield at each site) to be different from site to site, but not the slope with respect to N (representing the N treatment effect at each site). But if you recall from the plot we made, the response to N fertilization was very different at the different sites. We can add random slopes terms to the model to allow not only the intercept, but also the slope, of the yield-N relationship to vary by site.
To specify random slopes, we modify the design side of the
random-effect term. For our random intercept model, we used
(1 | site)
. To fit a random slope for each site with
respect to N, P, and K, we will use (1 + N + P + K | site)
.
We could also do this for blocks within sites but I would not advise
that. That’s because there are only a few data points within each block,
sometimes only one for each N treatment. That’s just barely enough to
get a good estimate of average yield by block for the random block
intercept, but not enough to get a good estimate of treatment effect
within a single block.
fit_randomNPKslopes <- brm(yield ~ 1 + N + P + K + (1 + N + P + K | site) + (1 | block:site),
data = stvincent,
prior = c(prior(normal(0, 10), class = b)),
chains = 4,
iter = 2000,
warmup = 1000,
seed = 4,
file = 'fits/fit_randomNPKslopes')
Let’s take a look at this fit.
summary(fit_randomNPKslopes)
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: yield ~ 1 + N + P + K + (1 + N + P + K | site) + (1 | block:site)
## Data: stvincent (Number of observations: 324)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Multilevel Hyperparameters:
## ~block:site (Number of levels: 36)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 1.44 0.32 0.86 2.15 1.00 1707 2478
##
## ~site (Number of levels: 9)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 4.54 1.42 2.50 7.94 1.00 2264 3103
## sd(N1N) 0.86 0.69 0.03 2.57 1.00 3069 2491
## sd(N2N) 3.41 1.03 1.86 5.88 1.00 2355 2695
## sd(N3N) 3.56 1.34 1.48 6.87 1.00 2425 2862
## sd(P1P) 1.16 0.80 0.06 3.00 1.00 2141 1832
## sd(P2P) 1.69 0.92 0.16 3.83 1.00 1736 1302
## sd(P3P) 1.03 0.87 0.03 3.15 1.00 2061 1447
## sd(K1K) 1.42 0.96 0.07 3.66 1.00 1702 1984
## sd(K2K) 1.44 0.85 0.13 3.37 1.00 1933 1832
## sd(K3K) 2.08 1.20 0.22 4.74 1.00 1907 1714
## cor(Intercept,N1N) 0.00 0.31 -0.58 0.58 1.00 7364 2833
## cor(Intercept,N2N) 0.02 0.25 -0.45 0.49 1.00 3727 3252
## cor(N1N,N2N) 0.09 0.31 -0.51 0.65 1.00 1761 2822
## cor(Intercept,N3N) -0.02 0.25 -0.49 0.45 1.00 4796 3507
## cor(N1N,N3N) 0.11 0.30 -0.50 0.66 1.00 2239 2811
## cor(N2N,N3N) 0.40 0.24 -0.14 0.80 1.00 3871 3328
## cor(Intercept,P1P) -0.03 0.29 -0.58 0.53 1.00 6263 2977
## cor(N1N,P1P) -0.03 0.30 -0.60 0.54 1.00 4491 2923
## cor(N2N,P1P) 0.21 0.30 -0.42 0.71 1.00 4457 3616
## cor(N3N,P1P) 0.11 0.30 -0.50 0.65 1.00 4642 3224
## cor(Intercept,P2P) 0.17 0.27 -0.39 0.67 1.00 4861 2786
## cor(N1N,P2P) 0.05 0.31 -0.53 0.62 1.00 3646 3160
## cor(N2N,P2P) -0.07 0.27 -0.57 0.47 1.00 4937 2844
## cor(N3N,P2P) 0.00 0.27 -0.53 0.52 1.00 4194 3552
## cor(P1P,P2P) 0.06 0.30 -0.52 0.62 1.00 2884 3174
## cor(Intercept,P3P) 0.03 0.30 -0.56 0.59 1.00 8537 3324
## cor(N1N,P3P) -0.02 0.30 -0.60 0.55 1.00 6147 2996
## cor(N2N,P3P) 0.02 0.29 -0.54 0.57 1.00 6191 2822
## cor(N3N,P3P) 0.01 0.30 -0.58 0.59 1.00 4966 2884
## cor(P1P,P3P) 0.05 0.30 -0.54 0.60 1.00 3686 3111
## cor(P2P,P3P) -0.00 0.30 -0.58 0.58 1.00 3421 3453
## cor(Intercept,K1K) 0.12 0.28 -0.45 0.63 1.00 5864 2942
## cor(N1N,K1K) -0.06 0.30 -0.62 0.54 1.00 3739 3730
## cor(N2N,K1K) 0.15 0.29 -0.44 0.66 1.00 4673 2986
## cor(N3N,K1K) 0.08 0.28 -0.48 0.61 1.00 5181 3411
## cor(P1P,K1K) 0.00 0.30 -0.56 0.60 1.00 4043 3532
## cor(P2P,K1K) 0.17 0.30 -0.46 0.71 1.00 3124 3685
## cor(P3P,K1K) -0.05 0.31 -0.62 0.55 1.00 3038 3532
## cor(Intercept,K2K) -0.07 0.27 -0.57 0.48 1.00 6403 3346
## cor(N1N,K2K) -0.01 0.30 -0.58 0.57 1.00 3513 2834
## cor(N2N,K2K) -0.24 0.27 -0.71 0.36 1.00 4546 3079
## cor(N3N,K2K) -0.16 0.28 -0.68 0.42 1.00 4305 3034
## cor(P1P,K2K) -0.09 0.29 -0.62 0.50 1.00 3021 3213
## cor(P2P,K2K) -0.09 0.29 -0.64 0.48 1.00 3761 3397
## cor(P3P,K2K) 0.03 0.30 -0.55 0.60 1.00 2940 3403
## cor(K1K,K2K) -0.03 0.29 -0.59 0.54 1.00 3180 3250
## cor(Intercept,K3K) -0.20 0.28 -0.69 0.37 1.00 5782 2947
## cor(N1N,K3K) -0.03 0.31 -0.59 0.56 1.00 4399 3366
## cor(N2N,K3K) 0.15 0.28 -0.40 0.65 1.00 4199 3246
## cor(N3N,K3K) 0.07 0.28 -0.47 0.60 1.00 4316 3603
## cor(P1P,K3K) -0.03 0.29 -0.58 0.53 1.00 3274 3386
## cor(P2P,K3K) -0.11 0.29 -0.64 0.48 1.00 3620 3379
## cor(P3P,K3K) -0.00 0.30 -0.58 0.55 1.00 2995 3025
## cor(K1K,K3K) 0.01 0.30 -0.57 0.56 1.00 2648 3166
## cor(K2K,K3K) -0.04 0.29 -0.59 0.51 1.00 3129 3599
##
## Regression Coefficients:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept 12.92 1.60 9.73 16.21 1.00 1383 2108
## N1N 0.69 5.74 -10.54 11.50 1.00 3132 2836
## N2N 3.88 1.30 1.41 6.44 1.00 2133 2419
## N3N 1.99 5.78 -9.46 12.99 1.00 3253 2797
## P1P 2.03 5.71 -8.85 13.56 1.00 2932 2900
## P2P 0.76 0.81 -0.87 2.36 1.00 3572 3061
## P3P 0.80 5.74 -10.43 12.28 1.00 2891 2983
## K1K 1.08 5.72 -10.28 12.09 1.00 2944 3108
## K2K -0.34 0.74 -1.79 1.15 1.00 3968 2507
## K3K 1.47 5.76 -9.92 12.61 1.00 2939 3170
##
## Further Distributional Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 3.06 0.14 2.80 3.36 1.00 3606 2777
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Wow, that’s a lot of parameters! The reason there are so many is that the random intercept and the three random slopes, each of which has three levels, all have a standard deviation and a correlation with one another.
Let’s take a look at the overall predictions for response to N addition. Note these means now average across a lot of things: the random intercepts by site and block (differing baselines), the random slopes by site (differing responses), and the other fixed effects (predictions are made averaged across all levels of P and K addition).
N_emmeans <- emmeans(fit_randomNPKslopes, ~ N)
N_emmeans
## N emmean lower.HPD upper.HPD
## 0N 14.3 7.89 20.7
## 1N 15.1 8.56 21.4
## 2N 18.2 10.98 24.7
## 3N 16.3 9.84 23.5
##
## Results are averaged over the levels of: P, K
## Point estimate displayed: median
## HPD interval probability: 0.95
N_contrasts <- contrast(N_emmeans, 'pairwise')
N_contrasts
## contrast estimate lower.HPD upper.HPD
## 0N - 1N -0.799 -11.68 10.24
## 0N - 2N -3.845 -6.46 -1.43
## 0N - 3N -2.165 -13.30 9.11
## 1N - 2N -3.091 -14.71 7.73
## 1N - 3N -1.298 -4.02 1.51
## 2N - 3N 1.772 -9.50 12.93
##
## Results are averaged over the levels of: P, K
## Point estimate displayed: median
## HPD interval probability: 0.95
The point estimates of the marginal means are closer together than in the model where we ignored P and K treatments and variation in response by site, and the 95% credible intervals are wider.
What about the site-level variation in responses? You can use the
ranef()
function to get a quick summary of the random
effects. I won’t show the output here to save space.
ranef(fit_randomNPKslopes)
We can use the tidybayes function
add_epred_draws()
to generate predictions for any desired
combination of fixed and random effects, and plot their posterior
distributions. Here epred
refers to expectation of the
posterior predictive.
Here, we will create a plot of site-level variation in N response. Predictions are made for all levels of N addition (fixed effect) at each site (random intercept and slope), at the control levels of P and K treatment, and averaging over the block-level random intercepts. Medians are connected with a line.
N_epred <- expand_grid(N = unique(stvincent$N),
P = '0P',
K = '0K',
site = unique(stvincent$site)) %>%
add_epred_draws(fit_randomNPKslopes, re_formula = ~ (1 + N | site), value = 'yield')
ggplot(N_epred, aes(x = N, y = yield)) +
stat_interval() +
stat_summary(fun = median, geom = 'point', size = 2) +
stat_summary(aes(x = as.numeric(N)), fun = median, geom = 'line') +
scale_color_brewer(palette = 'Greens') +
facet_wrap(~ site)
If you compare this plot with the original boxplots we made of the raw data at the beginning of this lesson, you might notice the trends within site are a little bit less extreme than the raw data might indicate. The difference between the model’s predictions of yield for each N level is a little bit smaller than the differences in the raw data for each site. That illustrates what’s called “shrinkage”: the mixed model is adjusting for the overestimation of trends we would have if we fit separate models for each field. The slopes “shrink” toward the mean overall slope. This makes the model provide better predictions for new data, at the expense of fitting the original dataset a little less closely. Check out this animated demo of shrinkage for more details!
Incidentally, generating this kind of plot with uncertainty intervals around random effect estimates is not easy to do for frequentist models. You usually have to do some kind of bootstrapping technique. But Bayesian models make it easy to estimate uncertainty around any quantity in the model with posterior distributions!
We can also add any interaction terms we would like if we want to
test whether any of the fixed effects interact. For example, is the
effect of N on yield stronger if no P is added, or if a high level of P
is added? You can specify the interaction between fixed effects
a
and b
by adding a:b
to the
fixed-effect part of the model formula, or a:b:c
for
three-way interactions, and so on.
I will leave the interaction models for you to fit as an exercise.
The models with different specifications for fixed and random effects
have different numbers of parameters. We have used the
pp_check()
plot to “eyeball” how well the models fit the
data. We can also use leave-one-out (LOO) cross-validation to produce an
information criterion for any of the models we’ve fit so far, and then
compare them. In brms this is a two-step process. First
use add_criterion()
to compute an information criterion for
each model, then use loo_compare()
to rank a set of models
by information criterion.
Let’s compare all the models we’ve fit: the intercept-only model, the model with N fixed effect only, the model with NPK fixed effects, the model with NPK fixed effects and all random slopes, and the model with NPK fixed effects but only random slopes for N.
fit_interceptonly_moresamples <- add_criterion(fit_interceptonly_moresamples, 'loo')
fit_fixedN_priors <- add_criterion(fit_fixedN_priors, 'loo')
fit_fixedNPK <- add_criterion(fit_fixedNPK, 'loo')
fit_randomNPKslopes <- add_criterion(fit_randomNPKslopes, 'loo')
loo_compare(fit_interceptonly_moresamples, fit_fixedN_priors, fit_fixedNPK, fit_randomNPKslopes)
## elpd_diff se_diff
## fit_randomNPKslopes 0.0 0.0
## fit_fixedN_priors -21.7 8.1
## fit_fixedNPK -22.7 7.7
## fit_interceptonly_moresamples -53.1 12.2
NOTE: You may see a warning about problematic observations. The number of problematic observations is small enough that we can ignore this warning for now. If many observations are found to be problematic, you might need to follow the recommended steps to deal with that issue.
This output ranks the models from best fit to worst by ELPD, or
expected log pointwise predictive density (a measure of how well the
model fit to all but one data point from the original dataset predicts
the new data point). The elpd_diff
column is the difference
between each other model and the one with the highest ELPD (thus the top
row will always have elpd_diff
of zero). The
se_diff
column is the standard error of that
difference.
We see that the fit with all random slopes included is ranked best in our comparison. The models without any random slopes perform far worse (by about 20 log units), and the intercept-only model is a very bad fit (not surprisingly). This should emphasize to us that averaging the response to fertilization across all sites does not tell the whole story in this system. We have to account for site-level variation in response. As a follow-up, it might be a good idea to measure different covariates at each site to try to account for this variation in response, instead of just treating among-site variation as random.
“Credible intervals are great and all,” you may be saying to yourself, “but what about a p-value? How am I supposed to know if x has an effect on y or not?” It’s understandable to feel that way. Bayesian analysis frees us from the restriction of always having to do all our tests against a (not necessarily realistic or relevant) null hypothesis and trying to reject it. But many people are more comfortable still thinking in the null hypothesis significance testing framework, and it may be important and necessary to test null hypotheses in certain real-world situations. Luckily, there are some Bayesian analogues to p-values that allow us to assess the strength of evidence for null hypotheses (see Makowski et al. 2019, Frontiers in Psychology).
One type of analogue to a p-value for a Bayesian model is the Bayes factor (BF). This is the ratio of evidence between two models (going back to Bayes’ theorem, it’s \(\frac{P(model_1|data)}{P(model_2|data)}\). BFs can range from 0 to infinity. A BF of 1 means a 1:1 ratio (we have equal evidence for both models). As BF increases, we have more and more evidence for model 1.
There’s no universally agreed-upon “cutoff” for declaring “significance” of a BF. There are two different interpretation tables shown in the Wikipedia article on Bayes factors. Roughly speaking, a BF of 10 or greater would indicate strong evidence.
The R package bayestestR, also part of easystats, allows us to compute a BF for individual model parameters by calculating the ratio of evidence for the posterior to evidence for the prior distribution of each parameter. We can use these values as a measure of the strength of evidence for each parameter’s posterior being different than the prior. Because our prior distributions were centered around zero, we can consider them roughly like “null hypotheses.” Thus a BF = 1 means our data didn’t change our belief about the “null” prior at all, BF > 1 would mean we have stronger evidence for the posterior, and BF < 1 means we believe more in our original null prior after seeing the data.
Let’s say we want to test the set of null hypotheses for our pairwise
comparisons between N addition levels. We have a posterior distribution
of the pairwise difference between the means for each N level. We need
to sample from the prior distribution of the model (for this we use
unupdate()
) and calculate the same set of comparisons.
Then, we can calculate a BF for each of the six pairwise comparisons.
Like p-values, BFs may be one-sided or two-sided (two-sided is the
default). By default the null hypothesis that the difference is 0 is
tested, but this default may be changed.
fit_randomNPKslopes_prior <- unupdate(fit_randomNPKslopes)
N_emmeans_prior <- emmeans(fit_randomNPKslopes_prior, ~ N)
N_contrasts_prior <- contrast(N_emmeans_prior, 'pairwise')
bayesfactor_parameters(posterior = N_contrasts, prior = N_contrasts_prior)
## Warning: Bayes factors might not be precise.
## For precise Bayes factors, sampling at least 40,000 posterior samples is
## recommended.
## Bayes Factor (Savage-Dickey density ratio)
##
## Parameter | BF
## -----------------
## 0N - 1N | 0.636
## 0N - 2N | 7.56
## 0N - 3N | 0.633
## 1N - 2N | 0.486
## 1N - 3N | 0.157
## 2N - 3N | 0.460
##
## * Evidence Against The Null: 0
Here we see that BF < 1 for most of the fixed effect slopes, except for the comparison between 0N and 2N, where BF is ~8. This indicates moderately strong evidence that the 2N treatment has different mean yield than the unfertilized control, but we have no evidence for any other fertilization effect on yield.
Warning: Bayes factors are very sensitive to your choice of prior. This is true even if your prior has little or no influence on the shape of the posterior. If you do decide to report BFs in a publication, be sure to carefully document how they were calculated.
Also notice that we see a warning about generating more posterior samples to get a precise estimate of BF. This is a good idea in practice but we won’t bother with that for this lesson.
Recently, additional analogues to p-values have been developed for Bayesian models and implemented in the bayestestR package. I will briefly show you two of them here.
The first is the MAP-based p-value. It is related to the odds against
a single null value, compared to the mode of the posterior distribution
of a model parameter or estimate (the “maximum a posteriori” or
MAP value). This value has a similar interpretation to a p-value, where
a lower value means lower odds that the null hypothesis is true. We use
the function p_pointnull()
to generate a MAP-based p-value
for each of our contrasts.
p_pointnull(N_contrasts, null = 0)
## MAP-based p-value
##
## Parameter | p (MAP)
## -------------------
## 0N - 1N | 0.962
## 0N - 2N | 0.015
## 0N - 3N | 0.869
## 1N - 2N | 0.858
## 1N - 3N | 0.562
## 2N - 3N | 0.940
The interpretation is similar here, where we have evidence that the difference between 0N and 2N is not zero, but not for any of the others. Note that this is not as sensitive to your choice of prior as BF is, but it is sensitive to the method used to estimate the posterior density — I won’t get into the details here.
Having the full distribution of our posterior estimates means we
aren’t limited to testing the hypothesis that a value is exactly zero.
We may also define a difference between means that we think is
biologically or practically relevant, and test whether we have evidence
that our observed difference is greater than that. For example, we might
decide that yield differences of less than 1 are practically equivalent.
In other words, if we would only increase yields by +1, it would not be
worth fertilizing with N. In that case (-1, 1) is a two-sided “region of
practical equivalence” (ROPE). We can use p_rope()
to get
probabilities that each pairwise difference lies inside that range.
p_rope(N_contrasts, range = c(-1, 1))
## Proportion of samples inside the ROPE [-1.00, 1.00]
##
## Parameter | p (ROPE)
## --------------------
## 0N - 1N | 0.131
## 0N - 2N | 0.014
## 0N - 3N | 0.116
## 1N - 2N | 0.114
## 1N - 3N | 0.358
## 2N - 3N | 0.121
We get a fairly similar interpretation to the other p-value analogues. What if we decided a yield benefit of 0.5 is worth it? We can reduce the width of the ROPE:
p_rope(N_contrasts, range = c(-0.5, 0.5))
## Proportion of samples inside the ROPE [-0.50, 0.50]
##
## Parameter | p (ROPE)
## --------------------
## 0N - 1N | 0.066
## 0N - 2N | 0.006
## 0N - 3N | 0.054
## 1N - 2N | 0.058
## 1N - 3N | 0.180
## 2N - 3N | 0.056
We start to see modest evidence that the difference is at least \(|0.5|\) for many of the pairwise comparisons. An advantage of the ROPE-based “p-value” is that it doesn’t use the prior distribution anywhere in the calculation. Thus it is not sensitive to your choice of prior. It also forces you to be explicit about the effect size that you judge to be relevant, which is a nice bonus. Of course, it’s very sensitive to your choice of upper and lower bounds of the ROPE — if you present or publish something like this, I’d recommend presenting probabilities for a range of ROPEs.
What did we just learn? Let’s take a look back at the learning objectives!
You now understand…
You now can…
That’s an amazing set of skills!
As I’ve already mentioned, this is not intended to be a replacement for a full Bayesian course. Richard McElreath has written an amazing Bayesian course called Statistical Rethinking. It has a print book and ebook version, as well as a set of free video lectures. Solomon Kurz translated all the code examples in the book to brms, ggplot2, and tidyverse code: Statistical Rethinking Recoded. Both are highly recommended.
Another useful book-length course is Bayes Rules!. I personally haven’t gone through the whole thing but other folks have praised it. It’s based on the rstan interface.
Here are some links to tutorials and lessons for learning basic Bayesian statistics with brms.
Here is a tutorial that goes through all the cool plots and tables you can make from a brms model with Matthew Kay’s tidybayes package. I’ve shown a few of them in this lesson but there are many other visualizations available.
Paul Buerkner published two papers on brms:
There are also other ways to fit Bayesian mixed models in R.
You can use Stan without R as well, if that’s your thing.
These three exercises should help you test your understanding of Bayesian statistics and using brms.
public domain image by Mont Sudbury
Imagine you read in “The Field Guide to North American Werewolves” that about 1 in every 1000 people in North America is a werewolf. You have a werewolf detector that is 99% accurate (it correctly identifies werewolves 99% of the time, and correctly identifies people as not werewolves also 99% of the time). You point your werewolf detector at the mailman and it says “DANGER! WEREWOLF DETECTED!” You use Bayes’ Theorem to calculate that the probability that the mailman really is a werewolf is 9%.
Question: In this situation, what is the data, what is the likelihood, what is the prior, and what is the posterior?
Take a look at these two MCMC trace plots. Which one do you think indicates better model convergence? What would you recommend the researcher do in each case?
Add one or more interaction terms to the model that we have been
working with. For example you could test for two-way interactions
N:P
, N:K
, and P:K
. Examine the
model diagnostics, show or plot the parameter estimates, and use LOO
cross-validation to compare this model to the ones we fit previously.
What can you conclude?
+ N:P
, to the fixed-effects part of the model formula.add_criterion(model, 'loo')
and then use
loo_compare()
.