R for SAS users: Lesson 3

What is this workshop?

  • For SAS users who want to learn R
  • SAS code is provided for comparison purposes
  • Lesson 3 in a series

IMPORTANT NOTE: In this lesson, the numerical results of the R and SAS code may no longer be identical. This is because of details in model specifications and differences in the fitting algorithms between R and SAS.

Conceptual learning objectives

During this workshop, you will …

  • Learn what generalized linear mixed models (GLMMs) are
  • Learn more about different predictions you can make from models
  • Learn about how to deal with more complex covariance structures

Practical skills

We will …

  • Import the data from a CSV file
  • Clean and reshape the data
  • Calculate some summary statistics and make some exploratory plots
  • Fit a generalized linear mixed-effects model with repeated measures error structure
  • Make plots and tables of results

How to follow along with this workshop

  • Slides and text version of lessons are online
    • usda-ree-ars.github.io/SEAStats
  • Fill in R code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

Load R packages

  • glmmTMB is a sophisticated GLMM fitting package that is more flexible than lme4
  • DHARMa lets us make residual diagnostic plots and tests for certain kinds of GLMM
library(tidyverse)
library(lme4)
library(easystats)
library(emmeans)
library(multcomp)
library(glmmTMB)
library(DHARMa)

theme_set(theme_bw())

CBPP dataset


Zebu, photo by Scott Bauer, USDA-ARS

  • cbpp (contagious bovine pleuropneumonia): dataset pre-loaded with the lme4 package.
  • 15 zebu cattle herds in Ethiopia
  • For each herd, number of zebu cattle that developed the disease and the total number of cattle in the herd
  • Repeated for 4 time periods (period is a factor)

Import data

SAS

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/cbpp.csv";
proc import datafile = csvFile out = cbpp replace dbms = csv; run;

data cbpp; set cbpp;
    rate = incidence/size;
run;

R

data(cbpp)

Examine the data

glimpse(cbpp)

Exploratory plots

  • Plot the proportion incidence/size for each herd at each time period

SAS

proc sgplot data = cbpp noautolegend;
    styleattrs datacontrastcolors = (black) datalinepatterns = (solid);
    series x = period y = rate / group = herd;
run;

R

ggplot(cbpp, aes(x = period, y = incidence/size, group = herd)) +
  geom_line() +
  theme_bw()

Fit model

  • Measurements from different time periods from the same herd are not independent
  • We must account for this with a linear mixed model
    • Response variable is the proportion of cattle suffering from CBPP, incidence/size,
    • Fixed effect is the discrete time period, period,
    • Random intercept for each herd, (1 | herd)
lmm_cbpp <- lmer(incidence/size ~ period + (1 | herd), data = cbpp)
  • Model diagnostic plots look reasonably good for a small dataset
check_model(lmm_cbpp)
  • Modeled estimates of the mean and 95% confidence interval of proportion CBPP for each period.
emmeans(lmm_cbpp, ~ period)
  • What is wrong here?
  • We cannot assume normally distributed error around mean proportions
  • We have to use a generalized linear mixed model (GLMM)
  • GLMM uses a link function to convert response variable to a scale where its error is roughly normal
  • Once we’ve transformed the response to that scale, we can treat it just like a linear mixed model
  • Afterward, we “undo” the transformation to interpret model means and confidence intervals on the original scale

GLMM for binary response variable

  • Here the response is binary (0 = animal is healthy, 1 = animal has CBPP)
  • We want to predict the probability a randomly chosen individual animal from a given herd at a given time will have CBPP
  • Probability ranges from 0 to 1, so it can’t be normal (bell curve has no upper and lower bound on the \(x\) axis)

Formula for predicting the probability of disease for individual \(i\) at time period \(j\) in herd \(k\)

\[\text{logit}~\hat{y}_{i,j,k} = \beta_0 + P_j + u_k\]

  • \(\hat{y}_{i,j,k}\): predicted probability of disease for individual \(i\) at time period \(j\) in herd \(k\)
  • \(\beta_0\): intercept
  • \(P_j\): fixed effect of time period \(j\)
  • \(u_k\): random intercept for each herd

What is “logit?”

  • That’s the link function!
  • Transforms the probability, which ranges from 0 to 1, to a scale that can take any value, positive or negative
  • logit function transforms from the probability scale to the log-odds scale:

\[\text{logit}~p = \log \frac {p}{1-p}\]

The logit function

graph of logit function

  • Maps values ranging from 0 to 1 to values ranging from negative to positive infinity
  • Roughly a straight line when the probability is between about 0.25 and 0.75
  • Gets steeper closer to 0 and 1
  • If you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a big difference in the odds

Model diagnostics

  • The default linear mixed model diagnostics from check_model() are not relevant here but we still need to check for issues
  • R package DHARMa simulates residuals and tests them
    • deviation from normality
    • overdispersion
    • outliers
    • homogeneity of variance
sim_resid_cbpp <- simulateResiduals(glmm_cbpp)
plot(sim_resid_cbpp)

Estimated marginal means of CBPP probability

  • We use the argument type = 'response'
    • Uses the inverse link function to back-transform the means and 95% CI from the log-odds scale to the probability scale
    • In SAS the option is ilink (inverse link) in lsmeans and slice statements
emmeans(glmm_cbpp, pairwise ~ period, type = 'response', adjust = 'tukey')

Repeated measures binary data: fungicide


Diseased cotton seedlings, photo by University of Missouri Extension

  • Three fungicides (DACD, RT, and TSX) applied to cotton seedlings, and untreated control
  • Randomized complete block design with 5 blocks
  • 400 plants per plot
  • At 12, 20, and 42 days after planting, the number of diseased seedlings out of the 400 in each plot was counted

Repeated measures

  • Same plots were measured three times
  • We cannot consider the three time points to be independent, we must model the covariance between them
  • “Repeated measures” a.k.a. “split plot in time”

Import the data

SAS

filename csvFile2 url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv";
proc import datafile = csvFile2 out = fungicide replace dbms = csv; run;

R

If you are using cloud server:

fungicide <- read_csv('data/fungicide.csv')

If you are running the code locally:

fungicide <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv')
  • Change some columns to factors, with control set as first reference level for the fungicide treatment column
fungicide <- fungicide %>%
  mutate(
    block = factor(block),
    plot = factor(plot),
    time = factor(time),
    fungicide = relevel(factor(fungicide), ref = 'None')
  )

Exploratory plots

  • Plot raw data points and medians as semi-transparent horizontal bars
  • Colorblind-friendly color palette (recommended)
  • Change over time? Differences from control? Differences between treatments?
ggplot(fungicide, aes(x = time, color = fungicide, group = fungicide, y = n_healthy)) +
  geom_point(size = 2, position = position_dodge(width = .5)) +
  stat_summary(fun = 'median', geom = 'point', shape = 95, size = 10, position = position_dodge(width = .5), alpha = .5) +
  theme_bw() +
  scale_color_okabeito(order = c(9, 1, 2, 3))
  • Plot time trends colored by block
  • Use facet_wrap() to split treatments into different panels
  • Differences between blocks?
ggplot(fungicide, aes(x = time, color = block, y = n_healthy)) +
  geom_line(aes(group = block)) +
  geom_point(size = 2) +
  facet_wrap(~ fungicide) +
  theme_bw() +
  scale_color_okabeito()

Setting up model formula

  • glmmTMB package allows us to fit more complex covariance structures to the data

  • cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block)

  • Response variable is two column vectors side-by-side: cbind(n_healthy, n_plants - n_healthy).

    • n_healthy: number of “successes” or healthy plants
    • n_plants - n_healthy: number of “failures” or the total number of plants minus the number of healthy plants
  • Fixed effects are fungicide, time, and their interaction: fungicide + time + fungicide:time

  • Random intercept term for each block: (1 | block)

Accounting for repeated measures: 3 different covariance structures

  • independent errors: separately estimated covariance parameter for each pair of time points
  • compound symmetry a.k.a. CS: fixes the covariance parameter for each pair of time points at a single value
  • first-order autoregressive a.k.a. AR(1): relates each time point to the one preceding it, assuming the same covariance between each pair of consecutive time points
  • CS and AR(1) are less flexible but have fewer parameters; this can help avoid overfitting

Fitting the models

  • Independent errors: (time | fungicide:block)
  • Compound symmetry: cs(time | fungicide:block)
  • AR(1): ar1(0 + time | fungicide:block)
    • 0 + time because AR(1) doesn’t “play well” with intercepts so we explicitly exclude it

Fitting the models

fit_inderr <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + (time | fungicide:block), 
  data = fungicide, family = binomial
)

fit_cserr <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + cs(time | fungicide:block), 
  data = fungicide, family = binomial
)

fit_ar1err <- glmmTMB(
  cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + ar1(0 + time | fungicide:block), 
  data = fungicide, family = binomial
)

How did the models turn out?

  • Independent errors model does not even converge (too many parameters for small dataset)
  • CS and AR1 converged
  • Check diagnostics using simulated residuals
sim_resid_csmodel <- simulateResiduals(fit_cserr)
plot(sim_resid_csmodel)

sim_resid_ar1model <- simulateResiduals(fit_ar1err)
plot(sim_resid_ar1model)

AIC comparison of CS and AR1 models

  • AIC measures how well the model fits the data with a penalty for having too many parameters (lower is better)
  • Difference is very small; we will go with the CS model moving forward
AIC(fit_cserr, fit_ar1err)

Estimated marginal means

  • Mean disease probability by fungicide treatment averaged across times, with pairwise contrasts
emm_fung <- emmeans(
  fit_cserr, pairwise ~ fungicide, 
  type = 'response', adjust = 'tukey'
)
  • Mean disease probability by fungicide treatment within each time point, with pairwise contrasts
emm_fung_by_time <- emmeans(
  fit_cserr, pairwise ~ fungicide | time, 
  type = 'response', adjust = 'tukey'
)

Odds ratios

  • In binomial model, the pairwise comparisons are odds ratios (back-transformed from the log odds ratio scale)
  • OR = 1 indicates no difference (same odds of outcome 1 and outcome 2)
  • OR < 1 indicates lower odds of outcome 1 vs outcome 2 (for example 1/3 = 1:1 odds vs. 3:1 odds, or 50% vs. 75% probability)
  • OR > 1 indicates higher odds of outcome 1 vs outcome 2 (for example 2 = 2:1 odds vs. 1:1 odds, or 67% vs. 50% probability)
  • For example, averaging across time points, OR for DACD:control is significantly < 1

Plots to visualize model predictions

  • cld() summarizes multiple comparisons with letters
  • Annotate means and 95% confidence interval error bars with letters
  • Multiple comparisons are Tukey-adjusted, 95% CIs are Sidak-adjusted
cld_fung <- cld(emm_fung$emmeans, adjust = 'tukey', Letters = letters)
cld_fung_by_time <- cld(emm_fung_by_time$emmeans, adjust = 'tukey', Letters = letters)

plot_data <- bind_rows(as_tibble(cld_fung), as_tibble(cld_fung_by_time)) %>%
  mutate(time = fct_na_value_to_level(time, 'overall') %>%
           factor(labels = c(paste(c(12, 20, 42), 'days after planting'), 'overall')))
ggplot(plot_data, aes(x = fungicide, y = prob, ymin = asymp.LCL, ymax = asymp.UCL, label = trimws(.group))) +
  geom_errorbar(linewidth = 0.9, width = 0.2) +
  geom_point(size = 3) +
  geom_text(aes(y  = asymp.UCL), vjust = -0.5, size = 5) +
  facet_wrap(~ time) +
  theme_bw() +
  scale_y_continuous(name = 'probability of healthy plant', limits = c(0.6, 0.9))

Fit models in SAS

  • Similar but not identical models
  • method = laplace, a maximum-likelihood method, specified to obtain AIC values
/* Independent errors */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block);
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* Compound symmetry */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = cs;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* AR1 */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = ar(1);
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

Conclusion

What have you learned in this lesson?

  • What GLMMs are and how link functions work
  • How to fit GLMMs to repeated measures binomial data
  • How to fit GLMMs with different error covariance structures
  • How to compare means from GLMMs and plot them

Nice job! Pat yourselves on the back once again!