This lesson picks up where Lesson 1 left off, continuing along with a practical introduction to fitting Bayesian multilevel models in R and Stan using the brms package (Bayesian Regression Models using Stan).
Download the worksheet for this lesson here.
We started out learning the basic ideas behind Bayes.
We learned how to harness the power of the Hamiltonian Monte Carlo method to fit Bayesian models, using the Stan language via the R package brms. Specifically, we learned how to:
Of course, we only scratched the surface of those complex topics. But if you don’t remember any of that, now would be a good time to review Lesson 1.
After teaching Lesson 1, I got feedback from the participants. People asked me to teach them how to fit mixed models that correspond to common study designs in agricultural science. I also was asked to take a little deeper dive into prior distributions and show how specifying different priors can impact the conclusions we come to. I designed Lessons 2 and 3 based on that feedback. In Lesson 2, you are going to learn how to:
You will learn some more practical skills for fitting Bayesian models and talking about the results in papers or presentations. We’ve seen these packages already in Lesson 1 but we will keep practicing them. In this lesson we will:
Load all the packages and set a plotting theme. Also define a color
scale that is colorblind-friendly. Create a directory called
fits
in your current working directory to store fitted
model objects, if it does not already exist.
library(brms)
library(agridat)
library(tidyverse)
library(emmeans)
library(tidybayes)
library(ggdist)
library(bayestestR)
library(gt)
theme_set(theme_bw())
colorblind_friendly <- scale_color_manual(values = unname(palette.colors(5)[-1]), aesthetics = c('colour', 'fill'))
if (!dir.exists('fits')) dir.create('fits')
Set brms options. The option
mc.cores = 4
means we will use four processor cores in
parallel to fit the models, meaning we can allow four Markov chains to
sample simultaneously. The options
brms.file_refit = 'on_change'
means that if a model has
already been fit and saved to a file, and the code and data are
unchanged, the model will be reloaded from the file instead of having to
fit it again. If you are running this lesson on the cloud server, there
are some pre-fit model objects to save time during the workshop.
options(mc.cores = 4, brms.file_refit = 'on_change')
If you are running this code on the Posit Cloud server, set these options instead of the ones above. This will use the faster and less memory-intensive cmdstanr package to fit the models behind the scenes. Unfortunately, the CmdStan software that cmdstanr uses is difficult to install on USDA laptops, but it can be done. Check out instructions for installing CmdStan here, and if those don’t work, instructions for a workaround to install CmdStan using Anaconda. If you end up wanting to run Bayesian models on your own data after this workshop, another option is to run brms code on the SciNet cluster, where the cmdstanr package can easily be configured.
options(mc.cores = 4, brms.file_refit = 'on_change', brms.backend = 'cmdstanr')
Before we dive into more complex models, let’s review Bayesian statistics a little bit. Don’t feel bad if it takes you a few times to really grasp the concepts. Most people need a lot of repetition before these ideas become second nature.
Let’s start by going over Bayes’ theorem again.
\[P(A|B) = \frac{P(B|A)P(A)}{P(B)}\]
This means the probability of A being true given that B is true (\(P(A|B)\)) is equal to the probability that B is true given that A is true (\(P(B|A)\)) times the ratio of probabilities that A and B are true (\(\frac{P(A)}{P(B)}\)). Using Bayes’ theorem in the context of fitting a statistical model to data, we can state this as:
\[P(model|data) = \frac {P(data|model)P(model)}{P(data)}\]
or
\[posterior \propto likelihood \times prior\]
What does this mean? It means that we have some prior knowledge about the world. We formally state that knowledge by giving a statistical distribution of what we think the parameters of the statistical model describing the world are without having collected any data yet. This is called the prior, which in Bayes’ theorem is represented by \(P(model)\) because it’s the probability of the model not dependent on the data. Then, we go out and collect some data. We use the data to calculate \(P(data|model)\) which we call the likelihood. That’s a function that just depends on the data. It represents how likely it was to observe the data we did, given all possible combinations of parameters we could have in our model. We have to divide the likelihood by the marginal probability or \(P(data)\) which is basically just some math to make the likelihood function have an area under the curve of 1. (In a probability distribution, adding up the probabilities of all possible mutually exclusive outcomes together has to result in a sum of 1, or 100% probability.) Then, when we multiply the prior and likelihood together, we get \(P(model|data)\) which we call the posterior. That is our new knowledge of the world, after collecting the data, represented by a new statistical distribution for our parameter estimates. Because it is the product of the prior and the likelihood, it represents a combination of what we knew before we got the data (the prior), updated by the new information we got from the data (the likelihood).
Likelihood is the basis of both the classical statistical models that you are familiar with, and of Bayesian models. What makes Bayesian models unique is the use of prior information. But that’s also what scares a lot of people. They are worried that using Bayesian models is like putting your thumb on the scale. The priors are subjective assumptions, they say. You can get whatever results you want because you can use any priors that you want, they say. It’s understandable to feel that way. Here is how I respond to those arguments:
Here is an illustration of how having more data increases the influence of the likelihood relative to the prior. We want to know the probability of a new vaccine successfully granting immunity to a disease.
Let’s say the baseline immunity rate is 50%. We will pessimistically assume that the vaccine is ineffective and so we will assume a prior distribution with the most probable outcome being 50%, but allowing for outcomes both above and below that. Here are three scenarios:
In all three of these scenarios, the observed immunity rate is 70%. Our prior distribution is the same in each case. Here’s what the posterior distributions look like:
As the amount of data goes up, the evidence that the true immunity rate of vaccinated animals is 70% becomes stronger. Thus, the influence of our prior distribution centered at 50% keeps getting weaker and weaker, and the influence of the likelihood (informed by the data) keeps getting stronger. The more data you have, the more evidence you have, and the more the likelihood pulls the posterior towards itself and away from the prior. This feels intuitive, at least to me.
I just said above that if you have lots of data, you don’t have to worry that much about the influence of the prior. But what if you have a very small dataset? Let’s look at an extreme case where we only have a very small amount of data on a controversial topic where people may have strong opinions about what the true effect “should” be.
In the agridat package, there is an example dataset
called gathmann.bt
. It includes abundance metrics of
non-target arthropod species on a line of transgenic Bt maize and on a
control line. You can type ?gathmann.bt
in your console to
pull up documentation for the dataset, including a reference to the
original source. When Bt corn was introduced, some people were concerned
that it would have bigger effects on non-target species than
conventional insecticides. But others argued that it actually has
smaller impacts.
Well, let’s get some data and resolve the debate! Unfortunately this dataset only has a very small sample size. We have \(n=8\) measurements of Thysanoptera abundance on Bt maize, and \(n=8\) measurements on the control line. Looking at the distribution of the raw data, it looks like the Bt line tends to have higher abundance.
data(gathmann.bt)
ggplot(gathmann.bt, aes(x = gen, y = thysan)) + geom_boxplot()
Edit the dataset so that the control line, ISO
, is
treated as the reference level in the model. Now if we get a positive
coefficient it will mean that Bt has higher Thysanoptera abundance than
control.
gathmann.bt <- mutate(gathmann.bt, gen = relevel(gen, ref = 'ISO'))
We want to compare the means of the two groups to each other. Let’s start by doing a frequentist analysis: a t-test.
t.test(formula = thysan ~ gen, data = gathmann.bt)
##
## Welch Two Sample t-test
##
## data: thysan by gen
## t = -3.2182, df = 13.834, p-value = 0.006274
## alternative hypothesis: true difference in means between group ISO and group Bt is not equal to 0
## 95 percent confidence interval:
## -9.25298 -1.84702
## sample estimates:
## mean in group ISO mean in group Bt
## 8.725 14.275
We get a p-value of 0.006 for a two-sided t-test, which we would usually say is enough evidence to reject the null hypothesis that the means are equal. The mean in the Bt group is higher, giving us significant evidence that Bt maize has higher abundance of at least one species of non-target arthropod than control maize. As always, the frequentist analysis only considers the likelihood, which is calculated from the data. It does not formally incorporate any prior knowledge we may have.
How would we do this with a Bayesian analysis? Let’s fit the model with a few different priors. I don’t really know what the units of the data are here, but they range from about 4 to 18. So we can say that anything that changes mean Thysanoptera abundance by +5 or -5 is a big effect.
I fit this model with no less than seven different priors. The default “flat” prior is completely uninformative. It assigns equal prior probability to any difference, including 0, +1, -1000, +1000000, you name it. This is not a realistic prior for most situations because we usually have at least some idea of the order of magnitude of the effect size, even if we have no idea if it will be positive or negative. Here are the other six prior distributions and what belief they might represent:
prior distribution | belief |
---|---|
normal(0,5) | We don't know if Bt has a positive or negative effect and we think there's a moderate probability of a large effect in either direction |
normal(0,1) | We don't know if Bt has a positive or negative effect and we think large effects in either direction are very improbable |
normal(5,5) | We think Bt has a positive effect but we're very uncertain |
normal(5,1) | We think Bt has a positive effect and we're very certain of that |
normal(-5,5) | We think Bt has a negative effect but we're very uncertain |
normal(-5,1) | We think Bt has a negative effect and we're very certain of that |
Here’s the code to fit the models.
Note: In this code we do not specify number of Markov chains, warmup iterations per chain, or total iterations including warmup and sampling. This means it will default to 4 chains, 1000 warmup iterations, and 2000 total iterations (1000 post-warmup sampling iterations) per chain. In practice for more complex models you may need to specify higher numbers of warmup iterations to aid convergence, and higher numbers of sampling iterations to accurately describe the shape of the posterior, especially its tails.
# Flat prior
m1gathmann <- brm(thysan ~ gen, data = gathmann.bt,
seed = 838, file = 'fits/m1gathmann')
# Prior saying that extreme differences between genotypes are unlikely
m2gathmann <- brm(thysan ~ gen, data = gathmann.bt,
prior = prior(normal(0, 5), class = b),
seed = 838, file = 'fits/m2gathmann')
# Prior with even stricter regularization
m3gathmann <- brm(thysan ~ gen, data = gathmann.bt,
prior = prior(normal(0, 1), class = b),
seed = 838, file = 'fits/m3gathmann')
# Prior if we think that Bt is more likely to have a positive impact on non-target species but we are uncertain
m4gathmann <- brm(thysan ~ gen, data = gathmann.bt,
prior = prior(normal(5, 5), class = b),
seed = 838, file = 'fits/m4gathmann')
# Prior if we think that Bt is more likely to have a positive impact on non-target species and we have strong prior evidence for that
m5gathmann <- brm(thysan ~ gen, data = gathmann.bt,
prior = prior(normal(5, 1), class = b),
seed = 838, file = 'fits/m5gathmann')
# Prior if we think that Bt is more likely to have a negative impact on non-target species but we are uncertain
m6gathmann <- brm(thysan ~ gen, data = gathmann.bt,
prior = prior(normal(-5, 5), class = b),
seed = 838, file = 'fits/m6gathmann')
# Prior if we think that Bt is more likely to have a negative impact on non-target species and we have strong prior evidence for that
m7gathmann <- brm(thysan ~ gen, data = gathmann.bt,
prior = prior(normal(-5, 1), class = b),
seed = 838, file = 'fits/m7gathmann')
We can pull out the posterior estimates for the difference between the means, and plot the median of the posterior with 95% credible interval.
get_post <- function(fit) {
fixedeff <- summary(fit)$fixed
setNames(fixedeff['genBt', c('Estimate','l-95% CI', 'u-95% CI')], c('median', 'lower95', 'upper95'))
}
gathmann_post_estimates <- cbind(
prior = c('flat', 'normal(0,5)', 'normal(0,1)', 'normal(5,5)', 'normal(5,1)', 'normal(-5,5)', 'normal(-5,1)'),
map_dfr(list(m1gathmann, m2gathmann, m3gathmann, m4gathmann, m5gathmann, m6gathmann, m7gathmann), get_post)
) %>%
mutate(prior = factor(prior, levels = unique(prior)))
ggplot(gathmann_post_estimates, aes(x = prior, y = median, ymin = lower95, ymax = upper95)) +
geom_pointrange() +
geom_hline(yintercept = 0, linetype = 'dashed', linewidth = 1) +
labs(y = 'effect of Bt on Thysanoptera abundance')
These are interesting results. We see that assuming a normal distribution on the prior with a standard deviation of 5 leads to very similar inference as a completely flat prior. This is true whether we set the mean of the normal to 0, +5, or -5. The informative prior \(\text{Normal}(5, 1)\) is similar but with a narrower uncertainty interval. However, if we set an informative prior with mean 0, \(\text{Normal}(0, 1)\), we are effectively saying that we believe the probability of Bt corn having either a strong positive or a strong negative effect is low a priori. This is reflected in our posterior. The point estimate is much closer to 0, and the 95% credible interval covers a lot of territory on either side of 0. This is even more of an issue with \(\text{Normal}(-5, 1)\) where the point estimate is negative and the 95% credible interval is completely negative.
So that we can roughly compare our results to the frequentist analysis, let’s calculate a Bayesian analogue of a p-value for each of these models. This is the “maximum a posteriori” or MAP p-value that we discussed in Lesson 1.
gathmann_pvalues <- data.frame(
prior = c('flat', 'normal(0,5)', 'normal(0,1)', 'normal(5,5)', 'normal(5,1)', 'normal(-5,5)', 'normal(-5,1)'),
pMAP = c(p_map(m1gathmann)$p_MAP[2], p_map(m2gathmann)$p_MAP[2], p_map(m3gathmann)$p_MAP[2], p_map(m4gathmann)$p_MAP[2], p_map(m5gathmann)$p_MAP[2], p_map(m6gathmann)$p_MAP[2], p_map(m7gathmann)$p_MAP[2])
)
prior | pMAP |
---|---|
flat | 0.009 |
normal(0,5) | 0.029 |
normal(0,1) | 0.583 |
normal(5,5) | 0.013 |
normal(5,1) | < 0.001 |
normal(-5,5) | 0.098 |
normal(-5,1) | < 0.001 |
To me, the conclusion here is that the prior distribution can influence our inference if it is a very narrow distribution, especially in this fairly extreme case where we are trying to draw big conclusions from a tiny dataset. Many people would say that \(\text{Normal}(0, 1)\) is an uninformative prior because it is agnostic about whether Bt will have a positive or negative effect. It is saying there’s a fifty-fifty chance that Bt will have a positive impact. However, because it is such a narrow distribution, it is not giving very much probability to either extreme negative or extreme positive effects of Bt. So in fact, it is actually fairly informative! Now, this isn’t to say that the inference we get from the model with the narrow prior is wrong. Maybe \(n=8\) is not enough to make much of a conclusion, and a skeptical prior with low probability of extreme effects is actually a good thing. Just because a Bayesian result does not match a frequentist result does not mean it is wrong. But to revisit another point made above, we should be wary of presenting one of the above results as if it is the only correct one. The ideal thing would be to document the influence of the prior and let the reader decide.
I hope the exercise on priors has gotten you back into the swing of things, thinking like a Bayesian. So far, in all of Lesson 1 and for the beginning of Lesson 2, we have only fit fairly simple models. For those situations, Bayesian and frequentist methods work equally well. Now we are going to start to explore some models where Bayesian methods really shine in comparison to frequentist methods.
So far, we’ve covered general linear mixed models. Whether we’re in Bayesian-land or not, those models have the following basic form.
\[y \sim \text{Normal}(Xb + Zu, \sigma), u \sim \text{Normal}(0, \sigma_u)\]
Hey, stop talking like a statistician! What does that actually mean? Basically, the response (y) is a function of fixed effects (X is the predictors and b are the fixed effect parameters) + random effects (Z is a model matrix and u are the random effect parameters) with normally distributed noise around it with standard deviation \(\sigma\). We also model the random effects as being normally distributed, with a mean of zero and a standard deviation \(\sigma_u\). In other words some of the random effects are positive, and some are negative, a few are far from zero but most are close to zero, and the same goes for the residuals. This model is an amazingly powerful tool and we should all be thankful to the statisticians that developed it.
But not all datasets fit neatly into this little box. A lot of y variables that we might meet out in the wild can’t be “forced” to have normally distributed residuals, no matter what parameters you estimate for the fixed and random effects. For example, what if y is a binary variable with only 0 and 1 values? Or what if it is a highly skewed concentration variable with many values at or near zero, and a few high positive values? In that case, you are not going to be able to get residuals that are normally distributed, and centered at zero, with about half of them positive and half of them negative.
For example, if you have positive skewed concentration data, the residuals are going to be positive skewed as well. The high positive values in the “tail” of the data are going to have high positive residuals too. Or for the binary data, the residuals are going to tend to have two peaks, clustering around 0 and 1. That doesn’t seem very normal to me.
Another issue is that the predictions made from a general linear mixed model fit to data that don’t meet the normally distributed residuals assumption often don’t make sense. For example if you predict from the model fit to the binary data, you will get predicted probabilities greater than 1 or less than 0. You will get negative values if you predict from the model fit to the skewed data. If your goal is prediction, and not just inference, this is a problem. Even if you only care about inference, it’s still a bad sign if your model makes bad predictions … because it probably doesn’t fit the data well … which you need to make good inferences.
By the way you may notice I am only considering the assumption of normally distributed residuals, not questioning the assumption of normally distributed random effects. That’s because it’s a lot more challenging to relax the assumption of normally distributed random effects and it is rarely done in practice.
The “old school” way of dealing with this issue was to transform the data. Sometimes people would try dozens of transformations, like alchemists trying all sorts of mixtures to see if they could make lead into gold. This is still something that people do routinely. For the most part, I avoid doing this. The one data transformation that I still commonly use is the log transformation, because many processes and phenomena in agricultural science make sense to compare on a ratio scale, and the difference between the logs of two variables is the same as the ratio of those variables on their original scale. So we don’t have to throw log transformations out just yet. But transformations like arcsine-square-root definitely belong in the past! Why is that? Because of something called generalized linear models!
Generalized linear mixed models look like this:
\[g(y) \sim \text{Distribution}(Xb + Zu | \theta), u \sim \text{Normal}(0, \sigma_u)\]
Looks just like the general linear mixed model above, with two differences. First is that the distribution doesn’t have to be normal. It can be any of a wide variety of distributions which may have different parameters (represented by \(\theta\)). The second is that instead of modeling y as a function of fixed effects, random effects, and residuals, we are now modeling g(y). What’s g()? It is a link function. That’s a function that gives us the relationship between the distribution function and the predicted value. Basically, it converts variables from the statistical distribution back into the scale of the data.
I won’t go that much more into detail here because this is more of an applied lesson to show you how to fit the models. But I would strongly encourage you to read about the theory behind GLMMs. There are quite a few non-technical and accessible texts on the subject. See my statistical learning resources page.
A generalized linear mixed model is just what it sounds like: a model that’s generalized (allows for response distributions that aren’t normal), linear (Y is modeled as a bunch of effects added together) and mixed (it includes both fixed and random effects). Let’s build it up little by little. First we’ll fit a model with two of the three: a generalized linear model. It has a non-normal distribution but for now we will only deal with fixed effects and not random effects.
In this lesson, we are going to use example data from the R package agridat, just as we did in Lesson 1.
It is a very simple dataset published by Crowder in 1978. There’s a
parasitic plant called Orobanche aegyptiaca that can’t
photosynthesize on its own. It leeches energy from the roots of other
plants. Because it can’t survive without a host plant, it needs to wait
until a suitable host is nearby to germinate. The seed can detect
compounds given off by the host plant in the soil, which stimulate it to
germinate.
You can type ?crowder.seeds
in the console to pull up the
documentation for the dataset.
The dataset comes from a 2×2 factorial experiment. There were two
Orobanche genotypes (called 'O73'
and
'O75'
in the dataset) that were each treated with two
different extracts: bean extract and cucumber extract. The experimenter
set up a lot of plates; each plate had a variable number of seeds from
one of the treatment combinations. The plates were laid out in a
completely randomized design, so there is no blocking factor to account
for. So no random effects are needed (at least for now).
For each plate we have a value for n
, the total number
of seeds on the plate, and germ
, the number of seeds that
germinated. Let’s compute a column for the proportion of seeds that
germinated and call it p
. Then let’s plot the distribution
of p
across the genotype by extract combinations.
data(crowder.seeds)
crowder.seeds <- mutate(crowder.seeds, p = germ/n)
ggplot(crowder.seeds, aes(x = gen, y = p, color = extract)) +
geom_boxplot() + colorblind_friendly
It seems like cucumber extract promotes germination more than bean extract. It looks like there is an interaction because the difference between bean and cucumber is greater for one genotype than the other.
What many people do in this case is fit a linear model directly to the proportion of germinated seeds (I see ARS scientists doing this all the time). This is often okay because if you have proportions anywhere between 0.05 and 0.95, the assumption of normally distributed error works out pretty well. Let’s try to do that here. We will put a sensible normal prior on the fixed effects. Because the response variable varies between 0 and 1, a normal prior on a categorical predictor’s coefficient with standard deviation 1 will encompass any possible effect size. Again we’re using all defaults for Monte Carlo sampling options.
crowder_lm <- brm(p ~ gen * extract,
prior = c(
prior(normal(0, 1), class = b)
),
data = crowder.seeds,
seed = 111, file = 'fits/crowder_lm')
We’ll look at this model’s predictions in a moment.
Let’s also take a look at the generalized linear model approach.
Here, our outcome is binary (a seed either germinates or it doesn’t).
So, instead of the response variable being the proportion of
germination, it needs to be specified with two parts: the number of
successes (germinated seeds in each dish) and the total number of trials
(all seeds in each dish including the ones that didn’t germinate). In
brms we write this germ | trials(n)
.
The other new thing about the generalized linear model is the
family
argument, which takes the form
family = <distribution function>(link = '<link function>')
.
This tells brm()
what the response distribution is, and
what the link function will be that determines the relationship between
the observed data scale and the model prediction scale. Here we are
going to use family = binomial(link = 'logit')
. Binomial is
a statistical distribution that has two possible values, 0 or 1. But
what is that “logit” thing?
Logit is the link function. Why do we need it? Well, our linear combination of covariates in a linear model can take any value, positive or negative. But we are trying to model the probability of germination, which can only be between 0 and 1. The link function is used to transform the unbounded value to a scale that ranges from 0 to 1. Logit is another word for the log-odds function, defined as
\[\text{logit}~p = \log \frac {p}{1-p}\]
Here is a graph of that function:
You can see that it maps the values ranging from 0 to 1 to values ranging from negative to positive infinity. It is more or less a straight line when the probability is between about 0.25 and 0.75, but starts to get really steep as you get close to the boundaries. This makes sense because if you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a relatively bigger difference.
Putting it all together, here’s the code for the generalized linear
model. As mentioned above, the differences are that we specify the
number of successes out of the total number of trials as the response
variable (germ | trials(n)
), and the response distribution
and link function as family = binomial(link = 'logit')
. A
normal(0, 5)
prior is sensible for fixed effects on the log
odds scale. Again we will use all default model fitting options.
crowder_glm <- brm(germ | trials(n) ~ gen * extract,
family = binomial(link = 'logit'),
prior = c(
prior(normal(0, 5), class = b)
),
data = crowder.seeds,
seed = 112, file = 'fits/crowder_glm')
Here are the posterior predictive check plots for each of these models. What do you notice about them?
pp_check(crowder_lm)
pp_check(crowder_glm)
Both posterior predictive check plots look reasonably good. The posterior draws from the model (thin blue lines) have a similar shape to the distribution of the observed data (thick black line). The two plots are different because the first model is modeling a proportion, so it should have values between 0 and 1, while the second model is modeling the number of germinated seeds, which is a positive whole number.
One thing to note is that the model predictions from the first model go below 0 and above 1. That’s because the general linear model assumes normally distributed error, so there is nothing stopping it from making nonsensical predictions of a probability of germination that is either negative or greater than 100%. The generalized linear model doesn’t have this issue.
Now calculate the estimated marginal means for the linear model. Here
the syntax pairwise ~ extract | gen
means within each level
of genotype (gen
), estimate the predicted mean for each
level of extract and take the pairwise contrast between them.
emm_crowder_lm <- emmeans(crowder_lm, pairwise ~ extract | gen)
emm_crowder_lm$emmeans
## gen = O73:
## extract emmean lower.HPD upper.HPD
## bean 0.326 0.182 0.467
## cucumber 0.467 0.319 0.596
##
## gen = O75:
## extract emmean lower.HPD upper.HPD
## bean 0.374 0.238 0.511
## cucumber 0.715 0.594 0.840
##
## Point estimate displayed: median
## HPD interval probability: 0.95
emm_crowder_lm$contrasts
## gen = O73:
## contrast estimate lower.HPD upper.HPD
## bean - cucumber -0.141 -0.333 0.0638
##
## gen = O75:
## contrast estimate lower.HPD upper.HPD
## bean - cucumber -0.340 -0.530 -0.1641
##
## Point estimate displayed: median
## HPD interval probability: 0.95
You can see that the pairwise contrasts are taken by subtracting one modeled proportion from the other. What do we get when we estimate the marginal means in the same way for the generalized linear model?
emm_crowder_glm <- emmeans(crowder_glm, pairwise ~ extract | gen)
emm_crowder_glm$emmeans
## gen = O73:
## extract emmean lower.HPD upper.HPD
## bean -0.412 -0.771 -0.0568
## cucumber 0.130 -0.201 0.4625
##
## gen = O75:
## extract emmean lower.HPD upper.HPD
## bean -0.560 -0.827 -0.3363
## cucumber 0.766 0.524 1.0144
##
## Point estimate displayed: median
## Results are given on the logit (not the response) scale.
## HPD interval probability: 0.95
emm_crowder_glm$contrasts
## gen = O73:
## contrast estimate lower.HPD upper.HPD
## bean - cucumber -0.548 -1.02 -0.0401
##
## gen = O75:
## contrast estimate lower.HPD upper.HPD
## bean - cucumber -1.329 -1.67 -0.9808
##
## Point estimate displayed: median
## Results are given on the log odds ratio (not the response) scale.
## HPD interval probability: 0.95
You can see that the estimated marginal means are given on the logit
scale, and the contrasts on the log odds ratio scale. This is sort of
hard to interpret. So you can use the argument
type = 'response'
which transforms the results back to the
data scale.
The logit link function in R is
qlogis()
and its inverse isplogis()
.
emmresponse_crowder_glm <- emmeans(crowder_glm, pairwise ~ extract | gen, type = 'response')
emmresponse_crowder_glm$emmeans
## gen = O73:
## extract prob lower.HPD upper.HPD
## bean 0.398 0.316 0.486
## cucumber 0.532 0.450 0.614
##
## gen = O75:
## extract prob lower.HPD upper.HPD
## bean 0.364 0.304 0.417
## cucumber 0.683 0.628 0.734
##
## Point estimate displayed: median
## Results are back-transformed from the logit scale
## HPD interval probability: 0.95
emmresponse_crowder_glm$contrasts
## gen = O73:
## contrast odds.ratio lower.HPD upper.HPD
## bean / cucumber 0.578 0.330 0.910
##
## gen = O75:
## contrast odds.ratio lower.HPD upper.HPD
## bean / cucumber 0.265 0.181 0.366
##
## Point estimate displayed: median
## Results are back-transformed from the log odds ratio scale
## HPD interval probability: 0.95
Now, the estimated marginal means and their credible intervals are
given as probabilities. The contrasts are now odds
ratios. This is the ratio of the odds of germination between
the two levels of extract. An odds ratio of 1 indicates no difference
between the odds. Because the ratio is bean / cucumber
and
the values are <1, we have evidence that bean extract is less
effective at stimulating germination than cucumber extract in both
genotypes.
Let’s try to make an inference about the difference between the
means. The function describe_posterior()
is a workhorse
from the bayestestR package. The posteriors we want to
describe here are the posterior distributions of the differences between
the means of the extract treatments within each genotype for the general
linear model, and the odds ratios of the means for the generalized
linear model. We have 1000 sampling iterations from each of 4 chains,
thus for each difference we have 4000 draws from its posterior
distribution.
By default describe_posterior()
gives us the median of
the 4000 values as the point estimate to characterize the distribution,
and we specify that we want an equal-tailed credible interval using
ci_method = 'eti'
. The default width is 95% so this will
give us the 2.5% and 97.5% quantiles of the 4000 values. We also specify
that we want two different null hypothesis tests (described in more
detail back in Lesson 1). The p_map
is the ratio of the
probability of the null value to the most probable posterior value
(point estimate); the lower, the more evidence in favor of a non-null
difference. The pd
is the percent of the posterior
distribution that has the same direction (sign) as the point estimate;
the closer to 100%, the more evidence in favor of a non-null
difference.
Notice that we have to specify that the null value for the odds ratio
contrasts is 1, because a ratio of 1 indicates no difference. The
default, null = 0
, is nonsensical because you cannot have a
ratio of 0.
describe_posterior(emm_crowder_lm$contrasts, ci_method = 'eti', test = c('pd', 'p_map'))
## Summary of Posterior Distribution
##
## Parameter | Median | 95% CI | p (MAP) | pd
## ----------------------------------------------------------------
## bean - cucumber O73 | -0.14 | [-0.34, 0.06] | 0.297 | 92.95%
## bean - cucumber O75 | -0.34 | [-0.52, -0.16] | 0.003 | 99.83%
describe_posterior(emmresponse_crowder_glm$contrasts, ci_method = 'eti', test = c('pd', 'p_map'), null = 1)
## Summary of Posterior Distribution
##
## Parameter | Median | 95% CI | p (MAP) | pd
## ------------------------------------------------------------
## bean/cucumber O73 | 0.58 | [0.36, 0.95] | 0.062 | 98.42%
## bean/cucumber O75 | 0.26 | [0.19, 0.38] | < .001 | 100%
You can see that the generalized linear model provides stronger evidence in favor of differences. You could say it has “higher power.” This is because it is taking into account the greater precision we can get from having many partially independent binary outcomes (seeds) in each dish. In contrast, the plain old linear model is ignoring that. If your response variable is a proportion, then a dish with 1000 seeds where 200 germinated (20%) will provide as much evidence as a dish with 10 seeds where 2 germinated. But in the generalized linear model, the dish with more seeds will, understandably, contribute more evidence.
What if we wanted to make a statistical inference about the
interaction between extract and genotype? Is the difference between the
response to bean and cucumber extract different between the two
genotypes? We can call the contrast()
function and specify
two levels of interaction contrast. This will first contrast the
extracts within each genotype, then it will contrast those two contrasts
with each other.
emmeans(crowder_glm, ~ extract + gen, type = 'response') %>%
contrast(interaction = c('revpairwise', 'pairwise')) %>%
describe_posterior(ci_method = 'eti', test = c('pd', 'p_map'))
## Summary of Posterior Distribution
##
## Parameter | Median | 95% CI | p (MAP) | pd
## --------------------------------------------------------------
## cucumber/bean O73/O75 | 0.46 | [0.25, 0.83] | < .001 | 100%
The ratio of the odds ratios is 0.46, indicating that our point estimate of the difference between extract responses in O73 is about half as big as for O75. There’s a 95% chance that the true ratio is between 0.25 and 0.85, and the evidence is in favor of a non-null ratio.
Now, because this is an applied stats lesson, in addition to showing you what statistical model is appropriate and how to fit it, I also want to give you the tools to go through all the steps of the “data to doc pipeline.” How would you report the results of this analysis in a manuscript? Let’s go ahead and create:
In your methods section, a description of the statistical model (excluding citations for methods and software) might be:
We fit a generalized linear model that modeled the germination probability for each dish as a function of extract type, genotype, and their interaction. The model assumed a binomial response distribution and logit link function. The model was fit with Hamiltonian Monte Carlo. Four chains were run, each with 1000 warmup iterations and 1000 sampling iterations. We verified model convergence by ensuring the R-hat statistics for all parameters were below 1.01. We assessed model fit by examining the posterior predictive check plot. We obtained posterior estimates of the marginal mean for each combination of extract and genotype, and of the odds ratio between each extract type within each genotype. We calculated the median and 95% equal-tailed credible interval for each estimated mean and odds ratio. To assess evidence for differences between extracts, we calculated the maximum a posteriori p-value (pMAP) and probability of direction (pd).
In the text part of your results section, you might say:
The modeled probability of germination was greater when seeds were treated with cucumber extract versus bean extract for both the O73 genotype (bean extract: posterior median 0.396, 95% CI [0.316, 0.478]; cucumber extract: 0.531 [0.458, 0.614]) and the O75 genotype (bean extract: 0.364 [0.305, 0.421]; cucumber extract: 0.681 [0.629, 0.734]). The odds ratio between cucumber and bean extract was smaller for the O73 genotype than the O75 genotype (ratio 0.46 [0.25, 0.83], pMAP < 0.001), indicating a greater relative effect of cucumber extract on germination probability in the O75 genotype.
A figure might show the median and several different credible
intervals of the distributions of posterior means for each treatment
combination, to give a fuller picture of the distribution. This requires
using the tidybayes package to pull out the individual
posterior draws from the emmeans
object with
gather_emmeans_draws()
and then use
stat_interval()
to simultaneously compute our desired
credible intervals and plot them. Notice we also need to transform the
means from the logit scale to the probability scale using
mutate(.value = plogis(.value))
because when we pull out
the values from the emmeans
object, the back-transformation
information is lost.
gather_emmeans_draws(emmresponse_crowder_glm$emmeans) %>%
mutate(p = plogis(.value)) %>%
ggplot(aes(x = gen, y = p, group = interaction(gen, extract))) +
stat_interval(aes(color = extract, color_ramp = after_stat(level)), .width = c(0.5, 0.8, 0.95), position = position_dodge(width = .3)) +
stat_summary(fun = 'median', geom = 'point', size = 2, position = position_dodge(width = .3)) +
colorblind_friendly +
labs(x = 'Genotype', y = 'Modeled germination probability', color_ramp = 'CI width', color = 'Extract')
Finally, a table could present the medians (95% CI) of the means and
the odds ratios for the two extracts within each genotype, which
requires a little rearranging to get in a compact wide format. I won’t
go into too much detail on the code. Focus on the
median_qi()
function, also from tidybayes.
It summarizes posterior draws with a median point estimate and a
quantile credible interval of a given width. Also notice that we need to
back-transform the means with plogis()
to go from logit
scale to probability scale, and back-transform the contrasts with
exp()
to go from log odds ratio scale to odds ratio
scale.
means_summary <- gather_emmeans_draws(emmresponse_crowder_glm$emmeans) %>%
mutate(.value = plogis(.value)) %>%
median_qi(.width = 0.95) %>%
pivot_wider(id_cols = gen, names_from = extract, values_from = c(.value, .lower, .upper))
contrasts_summary <- gather_emmeans_draws(emmresponse_crowder_glm$contrasts) %>%
mutate(.value = exp(.value)) %>%
median_qi(.width = 0.95)
left_join(means_summary, contrasts_summary) %>%
select(-c(contrast, .width, .point, .interval)) %>%
gt(caption = 'Germination probability') %>%
fmt_number(decimals = 3) %>%
cols_merge(c(.value_bean, .lower_bean, .upper_bean), pattern = '{1} [{2}, {3}]') %>%
cols_merge(c(.value_cucumber, .lower_cucumber, .upper_cucumber), pattern = '{1} [{2}, {3}]') %>%
cols_merge(c(.value, .lower, .upper), pattern = '{1} [{2}, {3}]') %>%
cols_label(gen = 'Genotype', .value_bean = 'Bean extract', .value_cucumber = 'Cucumber extract', .value = 'Bean:cucumber odds ratio')
Genotype | Bean extract | Cucumber extract | Bean:cucumber odds ratio |
---|---|---|---|
O73 | 0.398 [0.314, 0.484] | 0.532 [0.449, 0.613] | 0.578 [0.357, 0.955] |
O75 | 0.364 [0.308, 0.421] | 0.683 [0.627, 0.733] | 0.265 [0.188, 0.376] |
In Lesson 1 we fit a linear mixed model (LMM), which allowed for a mix of fixed treatment effects and random block effects, but assumed a normal response distribution. And just now we fit a generalized linear model (GLM) which allowed for a non-normal response distribution but did not include random effects. Now let’s take things to the next level with a GLMM. This powerful class of models can accommodate a wide variety of distributions as well as random effects. When it comes to estimating the parameters of GLMMs, Bayesian models really can’t be beat. Classical frequentist estimation algorithms often fail on datasets where the Bayesian approach can effortlessly get you an answer.
The model syntax for fitting GLMMs in brms won’t provide any big surprises for you. You have already seen how to specify a response family and link function, and how to include random terms on the right-hand side of the model formula. You’ll need to do both those things here.
We are going to use another example dataset from the
agridat package, mcconway.turnip
, the
results of an experiment on turnips originally published by McConway et
al. This dataset does not have a binary outcome, it has a continuous
outcome (yield). This variable has a distribution that’s extremely
common in agricultural datasets. Let’s take a look.
data(mcconway.turnip)
ggplot(mcconway.turnip, aes(x = yield)) + geom_histogram()
The distribution is far from symmetrical. We observe that yield cannot be a negative value; it has to be zero or greater. We see that there are many low values of yield, peaking close to zero, with a long tail of high values stretching up almost to 25. This kind of “positive right-skewed” data is all over the place in ag science. Here we will look at what kind of response family is appropriate in a GLMM fit to this kind of data.
Let’s also look at what kind of an experimental design was used here.
It is a 2×2×4 factorial design experiment, with three factors: genotype
(gen
, either "Barkant"
or
"Marco"
), planting date (date
, either
"21Aug1990"
or "28Aug1990"
), and planting
density (density
, 1, 2, 4, or 8 kg/ha). All 16 combinations
are represented in the data. It is a randomized complete block design
with four blocks. Because the blocks are complete, each block has 16
plots, one for each treatment combination. A single measurement of yield
was taken from each plot.
We will treat density as a discrete variable with four levels, instead of worrying about whether the effect of planting density on yield is linear.
mcconway.turnip <- mutate(mcconway.turnip, density = factor(density))
Make a boxplot of the yield values in each treatment combination.
Split the genotypes into separate panels using
facet_wrap()
, arrange the boxplots in increasing order of
density on the x-axis, and show the planting dates for each density with
a different color box.
ggplot(mcconway.turnip, aes(x = density, y = yield, group = interaction(density, date), color = date)) +
facet_wrap(~ gen) +
geom_boxplot() +
colorblind_friendly +
labs(x = 'Planting density (kg/ha)', y = 'Yield', color = 'Planting date')
Looking at the data, we can see that there is a clear interaction between planting date and density. At low density, yield is low regardless of planting date. At higher densities, yield is still low for the early planting date but high for the later one. There are not huge differences between the genotypes, but it does seem that Barkant has a bigger difference between the planting dates than Marco.
Another thing to notice is that there’s a relationship between the mean and the variance. The treatment combinations with lower values also have a narrower distribution. This is a problem for a general linear model, because a general linear model only has one standard deviation for all the residuals. If there is greater variance in some groups than others, that can lead to biased inference and inaccurate predictions.
Let’s start by pretending we don’t care about any of that
mean-variance business. Just fit a simple general linear mixed model to
the data. We will include all main effects, two-way interactions, and
three-way interactions between genotype, planting date, and density
using the syntax gen * date * density
on the right-hand
side of the model formula. To that, we will add (1 | block)
which means that we want a random intercept for each block. This means
the model will fit a separate intercept for each block and a variance
parameter to model how much the blocks systematically vary from one
another. As yield varies between 0 and 25 we will use
normal(0, 5)
as a prior. Once again, we’ll use all default
MC sampling options.
mcconway_lmm <- brm(yield ~ gen * date * density + (1 | block),
prior = c(
prior(normal(0, 5), class = b)
),
data = mcconway.turnip,
seed = 113, file = 'fits/mcconway_lmm')
Note: In this model and the following, we get a warning about a few divergent transitions. They are few enough that we can ignore the warning for the purpose of this lesson. Divergent transitions can occur when you are fitting models to small datasets. They may indicate that the posterior is being inadequately sampled. To deal with this you can either set narrower priors or tweak the sampler arguments so that the sampler makes smaller steps with each iteration; this is “safer” because it reduces the chance of a divergent transitions, but it can cause the sampling to be slower.
What does the posterior predictive check plot show?
pp_check(mcconway_lmm)
Clearly, our model is missing something. First of all, the draws from the posterior predictive (thin blue lines) clearly have a very different shape from the distribution of the observed data (thick black line). They are basically symmetrical and are not capturing the skewed shape of the data at all. Second, they are producing a lot of nonsensical values. A good portion of the distributions are < 0, meaning the fitted yield values for a lot of the data points are negative. As we discussed, this is not possible. You can also see the problem if you calculate estimated marginal means and look at the credible intervals, which we do here for density averaged over the other treatments:
emmeans(mcconway_lmm, ~ density)
## density emmean lower.HPD upper.HPD
## 1 2.08 -0.940 4.68
## 2 3.08 0.153 5.89
## 4 7.05 4.075 9.78
## 8 8.40 5.672 11.30
##
## Results are averaged over the levels of: gen, date
## Point estimate displayed: median
## HPD interval probability: 0.95
The 95% credible interval for planting density 1 includes negative
values. We are saying that there is a substantial probability that the
yield at planting density 1 is negative, which is nonsense. So we
definitely have to do something to make the model fit the data better.
What can we do? Well, the “old school” way, that lots of people,
including me, still do, is to log-transform the response variable.
Usually, this is helpful for positive, right-skewed data in three ways.
First, it makes the distribution more symmetrical and “normal-ish.”
Second, it ensures that all back-transformed estimates are positive
because you invert the log by exponentiating, and \(e^x > 0\). Finally, it stabilizes the
variance, meaning that it flattens out the relationship between the mean
and the variance. To fit a model with a log-transformed response
variable, we just need to put log()
around the response
variable on the left-hand side of the model formula. We also use a
normal(0, 2)
prior because yield only varies by about 2 to
3 units on a log scale.
mcconway_loglmm <- brm(log(yield) ~ gen * date * density + (1 | block),
prior = c(
prior(normal(0, 2), class = b)
),
data = mcconway.turnip,
seed = 114, file = 'fits/mcconway_loglmm')
Take a look at the posterior predictive check plot.
pp_check(mcconway_loglmm)
The model assuming normal error is doing a better job capturing the shape of the log-transformed data than it did with the untransformed data.
The model fit to the log-transformed response variable would probably be satisfactory in most cases. The transformation makes the data less skewed and deals with the unequal variance. However, when you transform data to a different scale, fit a model, and back-transform the model estimates back to the original scale, that introduces bias. It’s better to keep the data on the original scale and pick a distribution that really matches the data. Well, we already saw that the normal distribution doesn’t do the trick. For positive, right-skewed data a good go-to option is the Gamma distribution. It accommodates positive right-skewed data and has the added bonus of having a variance that’s proportional to the mean. So it will be able to deal with the unequal variance we see among groups in our data.
Here are some shapes that the gamma distribution can take. You can see that they are asymmetrical and always positive (the x-axis does not go below 0). Some of them look a lot like the skewed shape of our yield distribution.
We can fit a GLMM with gamma response distribution and log link
function using family = Gamma(link = 'log')
.
mcconway_glmm <- brm(yield ~ gen * date * density + (1 | block),
family = Gamma(link = 'log'),
prior = c(
prior(normal(0, 2), class = b)
),
data = mcconway.turnip,
seed = 115, file = 'fits/mcconway_glmm')
The posterior predictive check shows that the model is faithful to the positive, right-skewed distribution of the data!
pp_check(mcconway_glmm)
With our gamma GLMM, we can do fun things like estimate the marginal
means for each combination of planting density, planting date, and
genotype. These means and their credible intervals don’t include the
random effect of block. They are the expected value for a block that’s
“right in the middle” of the distribution of variation across blocks.
Note that type = 'response'
gives the predictions on the
data scale, not the log link scale.
emm_mcconway <- emmeans(mcconway_glmm, ~ density + date + gen, type = 'response')
emm_mcconway
## density date gen response lower.HPD upper.HPD
## 1 21Aug1990 Barkant 2.61 0.787 5.32
## 2 21Aug1990 Barkant 3.50 1.314 7.67
## 4 21Aug1990 Barkant 3.79 1.018 7.88
## 8 21Aug1990 Barkant 5.01 1.780 10.90
## 1 28Aug1990 Barkant 3.69 0.790 7.82
## 2 28Aug1990 Barkant 5.04 1.560 10.93
## 4 28Aug1990 Barkant 12.27 2.866 25.82
## 8 28Aug1990 Barkant 14.40 3.750 29.70
## 1 21Aug1990 Marco 1.33 0.466 2.93
## 2 21Aug1990 Marco 2.41 0.776 5.07
## 4 21Aug1990 Marco 3.36 0.837 7.02
## 8 21Aug1990 Marco 5.02 1.563 10.19
## 1 28Aug1990 Marco 1.28 0.334 2.65
## 2 28Aug1990 Marco 2.17 0.569 4.73
## 4 28Aug1990 Marco 7.14 1.959 15.21
## 8 28Aug1990 Marco 9.63 2.945 19.74
##
## Point estimate displayed: median
## Results are back-transformed from the log scale
## HPD interval probability: 0.95
Then we can take contrasts between any permutation of those that we want. For example, we can take pairwise contrasts between each density level within each combination of planting date and genotype.
contrast(emm_mcconway, 'revpairwise', by = c('date', 'gen'))
## date = 21Aug1990, gen = Barkant:
## contrast ratio lower.HPD upper.HPD
## density2 / density1 1.33 0.504 2.53
## density4 / density1 1.45 0.601 2.77
## density4 / density2 1.10 0.413 2.15
## density8 / density1 1.89 0.845 3.65
## density8 / density2 1.43 0.525 2.79
## density8 / density4 1.31 0.482 2.51
##
## date = 28Aug1990, gen = Barkant:
## contrast ratio lower.HPD upper.HPD
## density2 / density1 1.37 0.480 2.63
## density4 / density1 3.30 1.193 6.28
## density4 / density2 2.43 0.862 4.86
## density8 / density1 3.86 1.584 7.69
## density8 / density2 2.84 0.985 5.68
## density8 / density4 1.17 0.459 2.34
##
## date = 21Aug1990, gen = Marco:
## contrast ratio lower.HPD upper.HPD
## density2 / density1 1.81 0.689 3.55
## density4 / density1 2.52 0.873 5.01
## density4 / density2 1.39 0.520 2.88
## density8 / density1 3.74 1.301 7.16
## density8 / density2 2.07 0.731 4.26
## density8 / density4 1.50 0.525 2.96
##
## date = 28Aug1990, gen = Marco:
## contrast ratio lower.HPD upper.HPD
## density2 / density1 1.69 0.590 3.32
## density4 / density1 5.56 2.046 11.05
## density4 / density2 3.29 1.180 6.62
## density8 / density1 7.39 2.774 14.58
## density8 / density2 4.40 1.591 8.63
## density8 / density4 1.33 0.479 2.61
##
## Point estimate displayed: median
## Results are back-transformed from the log scale
## HPD interval probability: 0.95
The above results show strong statistical evidence for a positive relationship between planting density and yield for the later planting date, but much weaker evidence for the earlier planting date. The effect sizes tend to be greater for the Marco genotype than for the Barkant genotype. You could do interaction contrasts (contrasts of contrasts, as we did above for the germination example) to test whether the density effects are greater in Marco than Barkant:
contrast(emm_mcconway, 'revpairwise', by = 'date', interaction = 'revpairwise')
## date = 21Aug1990:
## density_revpairwise gen_revpairwise ratio lower.HPD upper.HPD
## 2 / 1 Marco / Barkant 1.39 0.358 3.16
## 4 / 1 Marco / Barkant 1.72 0.482 4.14
## 4 / 2 Marco / Barkant 1.26 0.267 3.27
## 8 / 1 Marco / Barkant 1.98 0.513 4.52
## 8 / 2 Marco / Barkant 1.46 0.315 3.59
## 8 / 4 Marco / Barkant 1.13 0.266 2.81
##
## date = 28Aug1990:
## density_revpairwise gen_revpairwise ratio lower.HPD upper.HPD
## 2 / 1 Marco / Barkant 1.23 0.269 3.17
## 4 / 1 Marco / Barkant 1.67 0.356 4.29
## 4 / 2 Marco / Barkant 1.37 0.247 3.63
## 8 / 1 Marco / Barkant 1.91 0.465 5.01
## 8 / 2 Marco / Barkant 1.55 0.291 4.07
## 8 / 4 Marco / Barkant 1.14 0.278 2.97
##
## Point estimate displayed: median
## Results are back-transformed from the log scale
## HPD interval probability: 0.95
We can see from the above that the model predicts Marco to have somewhat higher yield differences between planting density levels than Barkant, but the evidence is relatively weak (the 95% credible intervals indicates that there’s substantial probability of no difference or even for the Marco differences to be lower).
As we did above, let’s present this model and interpretation of the results, as we might in a manuscript, including:
A description of the statistical model in the methods section might look like this:
We fit a generalized linear mixed model that modeled turnip yield as a function of genotype, planting date, planting density (discrete variable with four levels), and their interactions. The model had a gamma response distribution with log link function. A random intercept was fit to each block. The model was fit with Hamiltonian Monte Carlo. Four chains were run, each with 1000 warmup iterations and 1000 sampling iterations. We verified model convergence by ensuring the R-hat statistics for all parameters were below 1.01. We assessed model fit by examining the posterior predictive check plot. We obtained posterior estimates of the marginal mean for each combination of genotype, planting date, and planting density. We also estimated the yield ratio between all pairs of density treatments within each combination of planting date and genotype. We calculated the median and 95% equal-tailed credible interval for each estimated mean and ratio.
A partial write-up in the results section, mainly focusing on the density findings, might look like this:
We found evidence that turnip yield was highest at the highest levels of planting density. The effect of density was highest for the later planting date. There was not strong evidence for a different effect of planting density between the genotypes; it was positive for both. Mean yield for density level 8 kg/ha at the later planting date was 14.66 (95% CI [4.28, 31.27]) for the Barkant genotype, versus 3.82 [1.14, 8.05] at 1 kg/ha; the ratio of yields between highest and lowest density levels was 3.84 [1.32, 7.60]. For the Marco genotype, the mean yield at later planting date was 9.80 [3.32, 21.01] for 8 kg/ha density and 1.31 [0.40, 2.82] for 1 kg/ha density; the yield ratio between the two density levels was 7.45 [2.63, 14.81].
Here is some code to make a nice plot of the estimated marginal
means, along with 95% credible intervals. Note that exp()
is used to back-transform the posterior estimates.
gather_emmeans_draws(emm_mcconway) %>%
mutate(.value = exp(.value)) %>%
ggplot(aes(x = density, y = .value, group = interaction(density, date))) +
stat_interval(aes(color = date, color_ramp = after_stat(level)), position = position_dodge(width = .5)) +
stat_summary(fun = 'median', geom = 'point', size = 2, position = position_dodge(width = .5)) +
facet_wrap(~ gen) +
colorblind_friendly +
labs(x = 'Planting density (kg/ha)', y = 'Modeled yield', color_ramp = 'CI width', color = 'Planting date')
Creating a table is left as an exercise.
We’ve made it to the end of Lesson 2! As always, let’s revisit the learning objectives to see what we’ve learned. You now have experience:
Along the way, you learned about a lot of R packages, first and foremost brms, but also emmeans, tidybayes, and bayestestR, among others. Give yourself a well-deserved round of applause!
Here are some resources I recommend if you are interested in learning more about GLMMs, specifically Bayesian GLMMs.
I also recommend going back to Lesson 1 and checking out the list of learning resources at the end of that lesson.
These exercises should give you some more practice with the concepts we learned in this lesson. They get progressively more difficult as you go.
After we fit the Gamma GLMM to the mcconway.turnip
dataset, we calculated posterior estimates of the yield ratios between
different planting density levels within each combination of planting
date and genotype. Calculate “Bayesian p-values” associated with those
ratios.
Refit the GLM model that we fit to the crowder.seeds
dataset, but this time use a normal(0, 0.5)
prior on the
fixed effects. Does this influence the conclusions we draw? Why or why
not?
For this exercise, we will use another example dataset from
agridat. Load the johnson.blight
dataset.
Make a boxplot showing the distribution of number of rainy days in April
and May (rain.am
, a continuous variable) for years that did
and did not have potato blight (blight
is a binary
variable, 0 if no blight occurred and 1 if blight occurred). Fit an
appropriate GLM to test whether the number of rainy days in April and
May influences the probability of potato blight. Ignore all other
columns for the purpose of this example. Inspect the model summary. What
can you conclude?
Hint: Use binomial family with logit link. Each row is only
a single outcome so the formula should begin with
blight | trials(1)
.