IMPORTANT NOTE: In this lesson, the numerical results of the R and SAS code may no longer be identical. This is because of details in model specifications and differences in the fitting algorithms between R and SAS.
Conceptual learning objectives
During this workshop, you will …
Learn what generalized linear mixed models (GLMMs) are
Learn more about different predictions you can make from models
Learn about how to deal with more complex covariance structures
Practical skills
We will …
Import the data from a CSV file
Clean and reshape the data
Calculate some summary statistics and make some exploratory plots
Fit a generalized linear mixed-effects model with repeated measures error structure
Make plots and tables of results
How to follow along with this workshop
Slides and text version of lessons are online
usda-ree-ars.github.io/SEAStats
Fill in R code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind
Load R packages
glmmTMB is a sophisticated GLMM fitting package that is more flexible than lme4
DHARMa lets us make residual diagnostic plots and tests for certain kinds of GLMM
cbpp (contagious bovine pleuropneumonia): dataset pre-loaded with the lme4 package.
15 zebu cattle herds in Ethiopia
For each herd, number of zebu cattle that developed the disease and the total number of cattle in the herd
Repeated for 4 time periods (period is a factor)
Import data
SAS
filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/cbpp.csv";
proc import datafile = csvFile out = cbpp replace dbms = csv; run;
data cbpp; set cbpp;
rate = incidence/size;
run;
R
data(cbpp)
Examine the data
glimpse(cbpp)
Exploratory plots
Plot the proportion incidence/size for each herd at each time period
SAS
proc sgplot data = cbpp noautolegend; styleattrs datacontrastcolors = (black) datalinepatterns = (solid); series x = period y = rate / group = herd;run;
R
ggplot(cbpp, aes(x = period, y = incidence/size, group = herd)) +geom_line() +theme_bw()
Fit model
Measurements from different time periods from the same herd are not independent
We must account for this with a linear mixed model
Response variable is the proportion of cattle suffering from CBPP, incidence/size,
Fixed effect is the discrete time period, period,
Random intercept for each herd, (1 | herd)
lmm_cbpp <-lmer(incidence/size ~ period + (1| herd), data = cbpp)
Model diagnostic plots look reasonably good for a small dataset
check_model(lmm_cbpp)
Modeled estimates of the mean and 95% confidence interval of proportion CBPP for each period.
emmeans(lmm_cbpp, ~ period)
What is wrong here?
We cannot assume normally distributed error around mean proportions
We have to use a generalized linear mixed model (GLMM)
Linear models have a linear predictor term (fixed + random effects) that can be any number, positive or negative
GLMM uses a link function to convert response to a scale that can also be any number
After making model predictions, we take the inverse of the link to interpret model means and confidence intervals on the original scale
GLMM for binary response variable
Here the response is binary (0 = animal is healthy, 1 = animal has CBPP)
We want to predict the probability a randomly chosen individual animal from a given herd at a given time will have CBPP
Probability ranges from 0 to 1, so it can’t have normal error (bell curve has no upper and lower bound on the \(x\) axis)
Formula for predicting the probability of disease for individual \(i\) at time period \(j\) in herd \(k\)
\(\hat{y}_{i,j,k}\): predicted probability of disease for individual \(i\) at time period \(j\) in herd \(k\)
\(\beta_0\): intercept
\(P_j\): fixed effect of time period \(j\)
\(u_k\): random intercept for each herd
What is “logit?”
That’s the link function!
Transforms the probability, which ranges from 0 to 1, to a scale that can take any value, positive or negative
logit function transforms from the probability scale to the log-odds scale:
\[\text{logit}~p = \log \frac {p}{1-p}\]
The logit function
Maps values ranging from 0 to 1 to values ranging from negative to positive infinity
Roughly a straight line when the probability is between about 0.25 and 0.75
Gets steeper closer to 0 and 1
If you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a big difference in the odds
Fit binomial GLMM with logit link: SAS
proc glimmix data = cbpp; class herd period; model incidence/size = period / dist = binomial; random herd; weight size; lsmeans period / ilink cl diff oddsratio adjust = tukey;run;
Fit binomial GLMM with logit link: R
Use glmer() instead of lmer(), same formula as before
Now we use family argument, for the “family” of response distributions we are using
Logit link is the default but to be explicit we specify binomial(link = 'logit')
weights = size accounts for the fact that the herds are different sizes; bigger herd = bigger contribution to the estimate
glmm_cbpp <-glmer(incidence/size ~ period + (1| herd), weights = size, data = cbpp, family =binomial(link ='logit'))
Model diagnostics
The default linear mixed model diagnostics from check_model() are not relevant here but we still need to check for issues
R package DHARMa simulates residuals and tests them