R for SAS users: Lesson 3

What is this workshop?

For SAS users who want to learn R
SAS code is provided for comparison purposes
Lesson 3 in a series

IMPORTANT NOTE: In this lesson, the numerical results of the R and SAS code may no longer be identical. This is because of details in model specifications and differences in the fitting algorithms between R and SAS.

Conceptual learning objectives

During this workshop, you will …

Learn what generalized linear mixed models (GLMMs) are
Learn more about different predictions you can make from models
Learn about how to deal with more complex covariance structures

Practical skills

We will …

Import the data from a CSV file
Clean and reshape the data
Calculate some summary statistics and make some exploratory plots
Fit a generalized linear mixed-effects model with repeated measures error structure
Make plots and tables of results

How to follow along with this workshop

Slides and text version of lessons are online
- usda-ree-ars.github.io/SEAStats
Fill in R code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind

Load R packages

glmmTMB is a sophisticated GLMM fitting package that is more flexible than lme4
DHARMa lets us make residual diagnostic plots and tests for certain kinds of GLMM

library(tidyverse)
library(lme4)
library(easystats)
library(emmeans)
library(multcomp)
library(glmmTMB)
library(DHARMa)

theme_set(theme_bw())

CBPP dataset

Zebu, photo by Scott Bauer, USDA-ARS

cbpp (contagious bovine pleuropneumonia): dataset pre-loaded with the lme4 package.
15 zebu cattle herds in Ethiopia
For each herd, number of zebu cattle that developed the disease and the total number of cattle in the herd
Repeated for 4 time periods (period is a factor)

Import data

SAS

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/cbpp.csv";
proc import datafile = csvFile out = cbpp replace dbms = csv; run;

data cbpp; set cbpp;
    rate = incidence/size;
run;

R

data(cbpp)

Examine the data

glimpse(cbpp)

Exploratory plots

Plot the proportion incidence/size for each herd at each time period

SAS

proc sgplot data = cbpp noautolegend;
    styleattrs datacontrastcolors = (black) datalinepatterns = (solid);
    series x = period y = rate / group = herd;
run;

R

ggplot(cbpp, aes(x = period, y = incidence/size, group = herd)) +
  geom_line() +
  theme_bw()

Fit model

Measurements from different time periods from the same herd are not independent
We must account for this with a linear mixed model
- Response variable is the proportion of cattle suffering from CBPP, incidence/size,
- Fixed effect is the discrete time period, period,
- Random intercept for each herd, (1 | herd)

lmm_cbpp <- lmer(incidence/size ~ period + (1 | herd), data = cbpp)

Model diagnostic plots look reasonably good for a small dataset

check_model(lmm_cbpp)

Modeled estimates of the mean and 95% confidence interval of proportion CBPP for each period.

emmeans(lmm_cbpp, ~ period)

What is wrong here?

We cannot assume normally distributed error around mean proportions
We have to use a generalized linear mixed model (GLMM)

Linear models have a linear predictor term (fixed + random effects) that can be any number, positive or negative
GLMM uses a link function to convert response to a scale that can also be any number
After making model predictions, we take the inverse of the link to interpret model means and confidence intervals on the original scale

GLMM for binary response variable

Here the response is binary (0 = animal is healthy, 1 = animal has CBPP)
We want to predict the probability a randomly chosen individual animal from a given herd at a given time will have CBPP
Probability ranges from 0 to 1, so it can’t have normal error (bell curve has no upper and lower bound on the \(x\) axis)

Formula for predicting the probability of disease for individual \(i\) at time period \(j\) in herd \(k\)

\[\text{logit}~\hat{y}_{i,j,k} = \beta_0 + P_j + u_k\]

\(\hat{y}_{i,j,k}\): predicted probability of disease for individual \(i\) at time period \(j\) in herd \(k\)
\(\beta_0\): intercept
\(P_j\): fixed effect of time period \(j\)
\(u_k\): random intercept for each herd

What is “logit?”

That’s the link function!
Transforms the probability, which ranges from 0 to 1, to a scale that can take any value, positive or negative
logit function transforms from the probability scale to the log-odds scale:

\[\text{logit}~p = \log \frac {p}{1-p}\]

The logit function

Maps values ranging from 0 to 1 to values ranging from negative to positive infinity
Roughly a straight line when the probability is between about 0.25 and 0.75
Gets steeper closer to 0 and 1
If you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a big difference in the odds

Fit binomial GLMM with logit link: SAS

proc glimmix data = cbpp;
    class herd period;
    model incidence/size = period / dist = binomial;
    random herd;
    weight size;
    lsmeans period / ilink cl diff oddsratio adjust = tukey;
run;

Fit binomial GLMM with logit link: R

Use glmer() instead of lmer(), same formula as before
Now we use family argument, for the “family” of response distributions we are using
Logit link is the default but to be explicit we specify binomial(link = 'logit')
weights = size accounts for the fact that the herds are different sizes; bigger herd = bigger contribution to the estimate

glmm_cbpp <- glmer(incidence/size ~ period + (1 | herd), weights = size, data = cbpp, family = binomial(link = 'logit'))

Model diagnostics

The default linear mixed model diagnostics from check_model() are not relevant here but we still need to check for issues
R package DHARMa simulates residuals and tests them
- deviation from normality
- overdispersion
- outliers
- homogeneity of variance

sim_resid_cbpp <- simulateResiduals(glmm_cbpp)
plot(sim_resid_cbpp)

Estimated marginal means of CBPP probability

We use the argument type = 'response'
- Uses the inverse link function to convert the means and 95% CI from the log-odds scale to the probability scale
- In SAS the option is ilink (inverse link) in lsmeans and slice statements

emmeans(glmm_cbpp, pairwise ~ period, type = 'response', adjust = 'tukey')

Repeated measures binary data: fungicide

Diseased cotton seedlings, photo by University of Missouri Extension

Three fungicides (DACD, RT, and TSX) applied to cotton seedlings, and untreated control
Randomized complete block design with 5 blocks
400 plants per plot
At 12, 20, and 42 days after planting, the number of diseased seedlings out of the 400 in each plot was counted

Repeated measures

Same plots were measured three times
We cannot consider the three time points to be independent, we must model the covariance between them
“Repeated measures” a.k.a. “split plot in time”

Import the data

SAS

filename csvFile2 url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv";
proc import datafile = csvFile2 out = fungicide replace dbms = csv; run;

R

If you are using cloud server:

fungicide <- read_csv('data/fungicide.csv')

If you are running the code locally:

fungicide <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv')

Change some columns to factors, with control set as first reference level for the fungicide treatment column

fungicide <- fungicide %>%
  mutate(
    block = factor(block),
    plot = factor(plot),
    time = factor(time),
    fungicide = relevel(factor(fungicide), ref = 'None')
  )

Exploratory plots

Plot raw data points and medians as semi-transparent horizontal bars
Colorblind-friendly color palette (recommended)
Change over time? Differences from control? Differences between treatments?

ggplot(fungicide, aes(x = time, color = fungicide, group = fungicide, y = n_healthy)) +
  geom_point(size = 2, position = position_dodge(width = .5)) +
  stat_summary(fun = 'median', geom = 'point', shape = 95, size = 10, position = position_dodge(width = .5), alpha = .5) +
  theme_bw() +
  scale_color_okabeito(order = c(9, 1, 2, 3))

Plot time trends colored by block
Use facet_wrap() to split treatments into different panels
Differences between blocks?

ggplot(fungicide, aes(x = time, color = block, y = n_healthy)) +
  geom_line(aes(group = block)) +
  geom_point(size = 2) +
  facet_wrap(~ fungicide) +
  theme_bw() +
  scale_color_okabeito()

Residual covariance matrix

Models we’ve fit so far assumed conditionally independent residuals (independent after accounting for fixed and random effects)
Those models still had a residual variance-covariance matrix \(\textbf{R} = \sigma^2\textbf{I}\)
- \(n \times n\) matrix, \(n\) is number of time points
- \(\sigma^2\) on every element of the diagonal: all residuals have same variance
- 0 everywhere else: zero covariance means no relationship between any two residuals

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}\]

Covariance matrix for correlated residuals

Our big step is to put nonzero values into \(\textbf{R}\) to model residual autocorrelation
There are many ways to model residual autocorrelation, we will cover three of the simplest ones today
- Unstructured covariance
- Compound symmetry
- First-order autoregressive

Unstructured covariance

Every pair of time points is modeled as having its own unique correlation
Every single off-diagonal element of \(\textbf{R}\) (symmetrical) has its own covariance parameter \(\rho_{ij}\) that we have to estimate
Most flexible because it has the most parameters

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & \rho_{12} & \rho_{13} & \rho_{14} \\ \rho_{12} & 1 & \rho_{23} & \rho_{24} \\ \rho_{13} & \rho_{23} & 1 & \rho_{34} \\ \rho_{14} & \rho_{24} & \rho_{34} & 1 \end{pmatrix}\]

Compound symmetry

Simplest model with the fewest parameters, but less flexible
Only a single covariance parameter \(\rho\) is estimated for all pairs of time points
Two residuals from the same individual are correlated the same, no matter how far apart in time they are

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & \rho & \rho & \rho \\ \rho & 1 & \rho & \rho \\ \rho & \rho & 1 & \rho \\ \rho & \rho & \rho & 1 \end{pmatrix}\]

First-order autoregressive, or AR(1)

Correlation between two time points within same individual is higher, the closer in time they are
Like compound symmetry, only a single covariance parameter \(\rho\)
Pairs of times 1 unit apart have correlation \(\rho\), 2 time units apart have \(\rho^2\), then \(\rho^3\) . . .

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & \rho & \rho^2 & \rho^3 \\ \rho & 1 & \rho & \rho^2 \\ \rho^2 & \rho & 1 & \rho \\ \rho^3 & \rho^2 & \rho & 1 \end{pmatrix}\]

Setting up model formula

glmmTMB package allows us to fit more complex covariance structures to the data
cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block)
Response variable is two column vectors side-by-side: cbind(n_healthy, n_plants - n_healthy).
- n_healthy: number of “successes” or healthy plants
- n_plants - n_healthy: number of “failures” or the total number of plants minus the number of healthy plants
Fixed effects are fungicide, time, and their interaction: fungicide + time + fungicide:time
Random intercept term for each block: (1 | block)

Fitting the models

Unstructured: us(0 + time | fungicide:block)
Compound symmetry: cs(0 + time | fungicide:block)
AR(1): ar1(0 + time | fungicide:block)

0 + time because we are only fitting the residual covariance for the time slope, not the intercept.

Fitting the models

fit_us <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + us(0 + time | fungicide:block), 
  data = fungicide, family = binomial
)

fit_cs <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + cs(0 + time | fungicide:block), 
  data = fungicide, family = binomial
)

fit_ar1 <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + ar1(0 + time | fungicide:block), 
  data = fungicide, family = binomial
)

How did the models turn out?

Unstructured covariance model does not even converge (too many parameters for small dataset)
CS and AR1 converged
Check diagnostics using simulated residuals

sim_resid_csmodel <- simulateResiduals(fit_cs)
plot(sim_resid_csmodel)

sim_resid_ar1model <- simulateResiduals(fit_ar1)
plot(sim_resid_ar1model)

AIC comparison of CS and AR1 models

AIC measures how well the model fits the data with a penalty for having too many parameters (lower is better)
Difference is very small; we will go with the AR1 model moving forward

AIC(fit_cs, fit_ar1)

Estimated marginal means

Mean disease probability by fungicide treatment averaged across times, with pairwise contrasts

emm_fung <- emmeans(
  fit_ar1, pairwise ~ fungicide, 
  type = 'response', adjust = 'tukey'
)

Mean disease probability by fungicide treatment within each time point, with pairwise contrasts

emm_fung_by_time <- emmeans(
  fit_ar1, pairwise ~ fungicide | time, 
  type = 'response', adjust = 'tukey'
)

Odds ratios

In binomial model, the pairwise comparisons are odds ratios (back-transformed from the log odds ratio scale)
OR = 1 indicates no difference (same odds of outcome 1 and outcome 2)
OR < 1 indicates lower odds of outcome 1 vs outcome 2 (for example 1/3 = 1:1 odds vs. 3:1 odds, or 50% vs. 75% probability)
OR > 1 indicates higher odds of outcome 1 vs outcome 2 (for example 2 = 2:1 odds vs. 1:1 odds, or 67% vs. 50% probability)
For example, averaging across time points, OR for DACD:control is significantly < 1

Plots to visualize model predictions

cld() summarizes multiple comparisons with letters
Annotate means and 95% confidence interval error bars with letters
Multiple comparisons are Tukey-adjusted, 95% CIs are Sidak-adjusted

cld_fung <- cld(emm_fung$emmeans, adjust = 'tukey', Letters = letters)
cld_fung_by_time <- cld(emm_fung_by_time$emmeans, adjust = 'tukey', Letters = letters)

plot_data <- bind_rows(as_tibble(cld_fung), as_tibble(cld_fung_by_time)) %>%
  mutate(time = fct_na_value_to_level(time, 'overall') %>%
           factor(labels = c(paste(c(12, 20, 42), 'days after planting'), 'overall')))

ggplot(plot_data, aes(x = fungicide, y = prob, ymin = asymp.LCL, ymax = asymp.UCL, label = trimws(.group))) +
  geom_errorbar(linewidth = 0.9, width = 0.2) +
  geom_point(size = 3) +
  geom_text(aes(y  = asymp.UCL), vjust = -0.5, size = 5) +
  facet_wrap(~ time) +
  theme_bw() +
  scale_y_continuous(name = 'probability of healthy plant', limits = c(0.6, 0.9))

Fit models in SAS

Similar but not identical models
method = laplace, a maximum-likelihood method, specified to obtain AIC values

/* Unstructured covariance */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = un;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* Compound symmetry */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = cs;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* AR1 */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = ar(1);
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

Conclusion

What have you learned in this lesson?

What GLMMs are and how link functions work
How to fit GLMMs to repeated measures binomial data
How to fit GLMMs with different error covariance structures
How to compare means from GLMMs and plot them

Nice job! Pat yourselves on the back once again!