R for SAS users: Lesson 3

What is this workshop?

  • For SAS users who want to learn R
  • SAS code is provided for comparison purposes
  • Lesson 3 in a series

IMPORTANT NOTE: In this lesson, the numerical results of the R and SAS code may no longer be identical. This is because of details in model specifications and differences in the fitting algorithms between R and SAS.

Conceptual learning objectives

During this workshop, you will …

  • Learn what generalized linear mixed models (GLMMs) are
  • Learn more about different predictions you can make from models
  • Learn about how to deal with more complex covariance structures

Practical skills

We will …

  • Import the data from a CSV file
  • Clean and reshape the data
  • Calculate some summary statistics and make some exploratory plots
  • Fit a generalized linear mixed-effects model with repeated measures error structure
  • Make plots and tables of results

How to follow along with this workshop

  • Slides and text version of lessons are online
    • usda-ree-ars.github.io/SEAStats
  • Fill in R code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

Load R packages

  • glmmTMB is a sophisticated GLMM fitting package that is more flexible than lme4
  • DHARMa lets us make residual diagnostic plots and tests for certain kinds of GLMM
library(tidyverse)
library(lme4)
library(easystats)
library(emmeans)
library(multcomp)
library(glmmTMB)
library(DHARMa)

theme_set(theme_bw())

CBPP dataset


Zebu, photo by Scott Bauer, USDA-ARS

  • cbpp (contagious bovine pleuropneumonia): dataset pre-loaded with the lme4 package.
  • 15 zebu cattle herds in Ethiopia
  • For each herd, number of zebu cattle that developed the disease and the total number of cattle in the herd
  • Repeated for 4 time periods (period is a factor)

Import data

SAS

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/cbpp.csv";
proc import datafile = csvFile out = cbpp replace dbms = csv; run;

data cbpp; set cbpp;
    rate = incidence/size;
run;

R

data(cbpp)

Examine the data

glimpse(cbpp)

Exploratory plots

  • Plot the proportion incidence/size for each herd at each time period

SAS

proc sgplot data = cbpp noautolegend;
    styleattrs datacontrastcolors = (black) datalinepatterns = (solid);
    series x = period y = rate / group = herd;
run;

R

ggplot(cbpp, aes(x = period, y = incidence/size, group = herd)) +
  geom_line() +
  theme_bw()

Fit model

  • Measurements from different time periods from the same herd are not independent
  • We must account for this with a linear mixed model
    • Response variable is the proportion of cattle suffering from CBPP, incidence/size,
    • Fixed effect is the discrete time period, period,
    • Random intercept for each herd, (1 | herd)
lmm_cbpp <- lmer(incidence/size ~ period + (1 | herd), data = cbpp)
  • Model diagnostic plots look reasonably good for a small dataset
check_model(lmm_cbpp)
  • Modeled estimates of the mean and 95% confidence interval of proportion CBPP for each period.
emmeans(lmm_cbpp, ~ period)
  • What is wrong here?
  • We cannot assume normally distributed error around mean proportions
  • We have to use a generalized linear mixed model (GLMM)
  • Linear models have a linear predictor term (fixed + random effects) that can be any number, positive or negative
  • GLMM uses a link function to convert response to a scale that can also be any number
  • After making model predictions, we take the inverse of the link to interpret model means and confidence intervals on the original scale

GLMM for binary response variable

  • Here the response is binary (0 = animal is healthy, 1 = animal has CBPP)
  • We want to predict the probability a randomly chosen individual animal from a given herd at a given time will have CBPP
  • Probability ranges from 0 to 1, so it can’t have normal error (bell curve has no upper and lower bound on the \(x\) axis)

Formula for predicting the probability of disease for individual \(i\) at time period \(j\) in herd \(k\)

\[\text{logit}~\hat{y}_{i,j,k} = \beta_0 + P_j + u_k\]

  • \(\hat{y}_{i,j,k}\): predicted probability of disease for individual \(i\) at time period \(j\) in herd \(k\)
  • \(\beta_0\): intercept
  • \(P_j\): fixed effect of time period \(j\)
  • \(u_k\): random intercept for each herd

What is “logit?”

  • That’s the link function!
  • Transforms the probability, which ranges from 0 to 1, to a scale that can take any value, positive or negative
  • logit function transforms from the probability scale to the log-odds scale:

\[\text{logit}~p = \log \frac {p}{1-p}\]

The logit function

graph of logit function

  • Maps values ranging from 0 to 1 to values ranging from negative to positive infinity
  • Roughly a straight line when the probability is between about 0.25 and 0.75
  • Gets steeper closer to 0 and 1
  • If you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a big difference in the odds

Model diagnostics

  • The default linear mixed model diagnostics from check_model() are not relevant here but we still need to check for issues
  • R package DHARMa simulates residuals and tests them
    • deviation from normality
    • overdispersion
    • outliers
    • homogeneity of variance
sim_resid_cbpp <- simulateResiduals(glmm_cbpp)
plot(sim_resid_cbpp)

Estimated marginal means of CBPP probability

  • We use the argument type = 'response'
    • Uses the inverse link function to convert the means and 95% CI from the log-odds scale to the probability scale
    • In SAS the option is ilink (inverse link) in lsmeans and slice statements
emmeans(glmm_cbpp, pairwise ~ period, type = 'response', adjust = 'tukey')

Repeated measures binary data: fungicide


Diseased cotton seedlings, photo by University of Missouri Extension

  • Three fungicides (DACD, RT, and TSX) applied to cotton seedlings, and untreated control
  • Randomized complete block design with 5 blocks
  • 400 plants per plot
  • At 12, 20, and 42 days after planting, the number of diseased seedlings out of the 400 in each plot was counted

Repeated measures

  • Same plots were measured three times
  • We cannot consider the three time points to be independent, we must model the covariance between them
  • “Repeated measures” a.k.a. “split plot in time”

Import the data

SAS

filename csvFile2 url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv";
proc import datafile = csvFile2 out = fungicide replace dbms = csv; run;

R

If you are using cloud server:

fungicide <- read_csv('data/fungicide.csv')

If you are running the code locally:

fungicide <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv')
  • Change some columns to factors, with control set as first reference level for the fungicide treatment column
fungicide <- fungicide %>%
  mutate(
    block = factor(block),
    plot = factor(plot),
    time = factor(time),
    fungicide = relevel(factor(fungicide), ref = 'None')
  )

Exploratory plots

  • Plot raw data points and medians as semi-transparent horizontal bars
  • Colorblind-friendly color palette (recommended)
  • Change over time? Differences from control? Differences between treatments?
ggplot(fungicide, aes(x = time, color = fungicide, group = fungicide, y = n_healthy)) +
  geom_point(size = 2, position = position_dodge(width = .5)) +
  stat_summary(fun = 'median', geom = 'point', shape = 95, size = 10, position = position_dodge(width = .5), alpha = .5) +
  theme_bw() +
  scale_color_okabeito(order = c(9, 1, 2, 3))
  • Plot time trends colored by block
  • Use facet_wrap() to split treatments into different panels
  • Differences between blocks?
ggplot(fungicide, aes(x = time, color = block, y = n_healthy)) +
  geom_line(aes(group = block)) +
  geom_point(size = 2) +
  facet_wrap(~ fungicide) +
  theme_bw() +
  scale_color_okabeito()

Residual covariance matrix

  • Models we’ve fit so far assumed conditionally independent residuals (independent after accounting for fixed and random effects)
  • Those models still had a residual variance-covariance matrix \(\textbf{R} = \sigma^2\textbf{I}\)
    • \(n \times n\) matrix, \(n\) is number of time points
    • \(\sigma^2\) on every element of the diagonal: all residuals have same variance
    • 0 everywhere else: zero covariance means no relationship between any two residuals

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{pmatrix}\]

Covariance matrix for correlated residuals

  • Our big step is to put nonzero values into \(\textbf{R}\) to model residual autocorrelation
  • There are many ways to model residual autocorrelation, we will cover three of the simplest ones today
    • Unstructured covariance
    • Compound symmetry
    • First-order autoregressive

Unstructured covariance

  • Every pair of time points is modeled as having its own unique correlation
  • Every single off-diagonal element of \(\textbf{R}\) (symmetrical) has its own covariance parameter \(\rho_{ij}\) that we have to estimate
  • Most flexible because it has the most parameters

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & \rho_{12} & \rho_{13} & \rho_{14} \\ \rho_{12} & 1 & \rho_{23} & \rho_{24} \\ \rho_{13} & \rho_{23} & 1 & \rho_{34} \\ \rho_{14} & \rho_{24} & \rho_{34} & 1 \end{pmatrix}\]

Compound symmetry

  • Simplest model with the fewest parameters, but less flexible
  • Only a single covariance parameter \(\rho\) is estimated for all pairs of time points
  • Two residuals from the same individual are correlated the same, no matter how far apart in time they are

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & \rho & \rho & \rho \\ \rho & 1 & \rho & \rho \\ \rho & \rho & 1 & \rho \\ \rho & \rho & \rho & 1 \end{pmatrix}\]

First-order autoregressive, or AR(1)

  • Correlation between two time points within same individual is higher, the closer in time they are
  • Like compound symmetry, only a single covariance parameter \(\rho\)
  • Pairs of times 1 unit apart have correlation \(\rho\), 2 time units apart have \(\rho^2\), then \(\rho^3\) . . .

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & \rho & \rho^2 & \rho^3 \\ \rho & 1 & \rho & \rho^2 \\ \rho^2 & \rho & 1 & \rho \\ \rho^3 & \rho^2 & \rho & 1 \end{pmatrix}\]

Setting up model formula

  • glmmTMB package allows us to fit more complex covariance structures to the data

  • cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block)

  • Response variable is two column vectors side-by-side: cbind(n_healthy, n_plants - n_healthy).

    • n_healthy: number of “successes” or healthy plants
    • n_plants - n_healthy: number of “failures” or the total number of plants minus the number of healthy plants
  • Fixed effects are fungicide, time, and their interaction: fungicide + time + fungicide:time

  • Random intercept term for each block: (1 | block)

Fitting the models

  • Unstructured: us(0 + time | fungicide:block)
  • Compound symmetry: cs(0 + time | fungicide:block)
  • AR(1): ar1(0 + time | fungicide:block)

0 + time because we are only fitting the residual covariance for the time slope, not the intercept.

Fitting the models

fit_us <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + us(0 + time | fungicide:block), 
  data = fungicide, family = binomial
)

fit_cs <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + cs(0 + time | fungicide:block), 
  data = fungicide, family = binomial
)

fit_ar1 <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + ar1(0 + time | fungicide:block), 
  data = fungicide, family = binomial
)

How did the models turn out?

  • Unstructured covariance model does not even converge (too many parameters for small dataset)
  • CS and AR1 converged
  • Check diagnostics using simulated residuals
sim_resid_csmodel <- simulateResiduals(fit_cs)
plot(sim_resid_csmodel)

sim_resid_ar1model <- simulateResiduals(fit_ar1)
plot(sim_resid_ar1model)

AIC comparison of CS and AR1 models

  • AIC measures how well the model fits the data with a penalty for having too many parameters (lower is better)
  • Difference is very small; we will go with the AR1 model moving forward
AIC(fit_cs, fit_ar1)

Estimated marginal means

  • Mean disease probability by fungicide treatment averaged across times, with pairwise contrasts
emm_fung <- emmeans(
  fit_ar1, pairwise ~ fungicide, 
  type = 'response', adjust = 'tukey'
)
  • Mean disease probability by fungicide treatment within each time point, with pairwise contrasts
emm_fung_by_time <- emmeans(
  fit_ar1, pairwise ~ fungicide | time, 
  type = 'response', adjust = 'tukey'
)

Odds ratios

  • In binomial model, the pairwise comparisons are odds ratios (back-transformed from the log odds ratio scale)
  • OR = 1 indicates no difference (same odds of outcome 1 and outcome 2)
  • OR < 1 indicates lower odds of outcome 1 vs outcome 2 (for example 1/3 = 1:1 odds vs. 3:1 odds, or 50% vs. 75% probability)
  • OR > 1 indicates higher odds of outcome 1 vs outcome 2 (for example 2 = 2:1 odds vs. 1:1 odds, or 67% vs. 50% probability)
  • For example, averaging across time points, OR for DACD:control is significantly < 1

Plots to visualize model predictions

  • cld() summarizes multiple comparisons with letters
  • Annotate means and 95% confidence interval error bars with letters
  • Multiple comparisons are Tukey-adjusted, 95% CIs are Sidak-adjusted
cld_fung <- cld(emm_fung$emmeans, adjust = 'tukey', Letters = letters)
cld_fung_by_time <- cld(emm_fung_by_time$emmeans, adjust = 'tukey', Letters = letters)

plot_data <- bind_rows(as_tibble(cld_fung), as_tibble(cld_fung_by_time)) %>%
  mutate(time = fct_na_value_to_level(time, 'overall') %>%
           factor(labels = c(paste(c(12, 20, 42), 'days after planting'), 'overall')))
ggplot(plot_data, aes(x = fungicide, y = prob, ymin = asymp.LCL, ymax = asymp.UCL, label = trimws(.group))) +
  geom_errorbar(linewidth = 0.9, width = 0.2) +
  geom_point(size = 3) +
  geom_text(aes(y  = asymp.UCL), vjust = -0.5, size = 5) +
  facet_wrap(~ time) +
  theme_bw() +
  scale_y_continuous(name = 'probability of healthy plant', limits = c(0.6, 0.9))

Fit models in SAS

  • Similar but not identical models
  • method = laplace, a maximum-likelihood method, specified to obtain AIC values
/* Unstructured covariance */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = un;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* Compound symmetry */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = cs;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* AR1 */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = ar(1);
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

Conclusion

What have you learned in this lesson?

  • What GLMMs are and how link functions work
  • How to fit GLMMs to repeated measures binomial data
  • How to fit GLMMs with different error covariance structures
  • How to compare means from GLMMs and plot them

Nice job! Pat yourselves on the back once again!