Introduction

What is this workshop?

Welcome back to the R for SAS users workshop! This workshop is intended for SAS users who want to learn R. The people who will get the most out of this course are practicing researchers who have a decent working knowledge of SAS, and of basic statistical analysis (descriptive stats and regression models) as it applies to their field.

This is lesson 3 of 3 in a series. Lesson 1 covered the basics: importing data, cleaning and reshaping data, summary statistics, simple graphs and tables, and a few simple statistical models. Lesson 2 got a little more advanced, covering linear mixed models for more sophisticated experimental designs and how to produce and compare estimated marginal means.

Download the worksheet for this lesson here.

IMPORTANT NOTE: In this lesson, the numerical results of the R and SAS code may no longer be identical, as they were in previous lessons. This is because different fitting algorithms are used by SAS PROC GLIMMIX and by the R model fitting packages that we are demonstrating. A full discussion of these differences is outside the scope of this lesson!

What will you learn from this workshop?

Conceptual learning objectives

During this workshop, you will …

  • Learn what generalized linear mixed models (GLMMs) are
  • Learn more about different predictions you can make from models
  • Learn about how to deal with more complex covariance structures

Practical skills

As in Lessons 1 and 2, we will work through a “data to doc” pipeline in R, comparing R code to SAS code each step of the way. We will use yet another different dataset.

We will …

  • Import the data from a CSV file
  • Clean and reshape the data
  • Calculate some summary statistics and make some exploratory plots
  • Fit a generalized linear mixed-effects model with repeated measures error structure
  • Make plots and tables of results

How to follow along with this workshop

  • Slides and text version of lessons are online
  • Fill in R code in the worksheet (replace ... with code)
  • This lesson also includes a template notebook that you can fill in
  • You can always copy and paste code from text version of lesson if you fall behind
  • Notes on best practices will be marked with PROTIP as we go along!

Cattle pneumonia example data analysis

As in the previous lessons, we will start with raw data and work our way to a finished product. Hopefully this is becoming second nature to you by now!

Load R packages

Here we’ll load the R packages we are going to work with today. These are mostly the same as the previous lessons. This includes the tidyverse package for reading, manipulating, and plotting data, the lme4 package for fitting linear mixed models, and the easystats package which has some good model diagnostic plots. Now we’re also using glmmTMB, a more advanced mixed model fitting package and DHARMa for GLMM model residual diagnostic plots. Set a default plotting theme as well.

library(tidyverse)
library(lme4)
library(easystats)
library(emmeans)
library(multcomp)
library(glmmTMB)
library(DHARMa)

theme_set(theme_bw())

The dataset

The first dataset we will use for this lesson is the cbpp or contagious bovine pleuropneumonia dataset. It is pre-loaded with the lme4 package. The number of Ethiopian zebu cattle that developed the disease in each herd, and the total number of cattle in the herd, is recorded, for each of four time periods. The herds (1-15) are identified with numerical IDs, and the time periods are identified by the integers 1-4. Note the period column is already a factor variable when you examine the pre-loaded dataset.


Zebu, photo by Scott Bauer, USDA-ARS

PROTIP: You can call ?cbpp in the R console to see documentation for the dataset. This can be done for any example dataset that comes pre-loaded with a package.

Note I’ve made a copy of the cbpp dataset as a CSV file so that you can import the same dataset into SAS if you want to follow along with the SAS code.

SAS

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/cbpp.csv";
proc import datafile = csvFile out = cbpp replace dbms = csv; run;

data cbpp; set cbpp;
    rate = incidence/size;
run;

R

data(cbpp)
glimpse(cbpp)
Rows: 56
Columns: 4
$ herd      <fct> 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, …
$ incidence <dbl> 2, 3, 4, 0, 3, 1, 1, 8, 2, 0, 2, 2, 0, 2, 0, 5, 0, 0, 1, 3, …
$ size      <dbl> 14, 12, 9, 5, 22, 18, 21, 22, 16, 16, 20, 10, 10, 9, 6, 18, …
$ period    <fct> 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, …

Exploratory plots

Let’s make a plot of the CBPP data. We will plot the proportion incidence/size for each herd at each time period.

SAS

proc sgplot data = cbpp noautolegend;
    styleattrs datacontrastcolors = (black) datalinepatterns = (solid);
    series x = period y = rate / group = herd;
run;
The SGPlot Procedure 1.0 1.5 2.0 2.5 3.0 3.5 4.0 period 0.0 0.1 0.2 0.3 0.4 0.5 0.6 rate

R

ggplot(cbpp, aes(x = period, y = incidence/size, group = herd)) +
  geom_line() +
  theme_bw()

Plot of raw CBPP incidence data by time period and herd

We can see from the above that the proportion of cows in each herd that show disease symptoms varies quite a lot between herds and over time.

Fit linear model

As we’ve learned in the previous lessons, if we have non-independence in our measurements, we have to account for that using a linear mixed model. We definitely have that in this case because the same herds were sampled four times, so measurements from period 1-4 from the same herd are not independent of one another.

So here’s a linear mixed model fit using lmer() (SAS code not presented here). The response variable is the proportion of cattle suffering from CBPP, incidence/size, the fixed effect is the time period treated as a discrete categorical variable, period, and there is a random intercept for each herd, (1 | herd).

lmm_cbpp <- lmer(incidence/size ~ period + (1 | herd), data = cbpp)

So far so good, right?

Linear model residual diagnostics

The model diagnostic plots also look reasonably good for a small dataset.

check_model(lmm_cbpp)

Residual diagnostic plot for CBPP linear model

Assess linear model predictions

Let’s get the model’s estimate of the mean and 95% confidence interval of proportion incidence of CBPP for each of the four periods.

emmeans(lmm_cbpp, ~ period)
 period emmean     SE df lower.CL upper.CL
 1      0.2198 0.0323 52   0.1550    0.285
 2      0.0741 0.0335 52   0.0069    0.141
 3      0.0894 0.0335 52   0.0222    0.157
 4      0.0413 0.0348 52  -0.0286    0.111

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 

The estimated marginal means decrease over time from period 1 (0.22) to period 4 (0.04). But the 95% confidence limits indicate a problem. The lower 95% confidence limit on the mean disease incidence proportion is negative! It is not possible to have a proportion less than 0 or greater than 1. Our assumption of normally distributed error around the mean proportions isn’t valid.

Generalized linear mixed model

This is where we need to use generalized linear mixed models (GLMMs). What GLMMs do is use a link function to essentially convert the response variable onto a scale where its error actually is close enough to normal. Then, from the model’s perspective, there’s no problem. We can treat it as if it were a linear mixed model, and then just undo the transformation afterward so that we can interpret the estimates of model means and confidence intervals on the original data scale.

Here, our response is binary (either a cow has the disease or it doesn’t). Modeling a binary response can be done with a generalized linear model. We use a link function to convert the predicted response value into a scale that can be normally distributed. What we are predicting is the probability that an individual cow from a given herd in a given time period will have CBPP. Probability ranges from 0 to 1, so it cannot really have normally distributed error because normal distributions (bell curves) can take any value, positive or negative. So a bounded range like 0 to 1 doesn’t really play well with normal distributions.

The formula for predicting the probability of disease for individual \(i\) at time period \(j\) in herd \(k\) is as follows:

\[\text{logit}~\hat{y}_{i,j,k} = \beta_0 + P_j + u_k\]

In this formula, \(\hat{y}_{i,j,k}\) is the predicted probability of disease for individual \(i\) at time period \(j\) in herd \(k\). \(\beta_0\) is the intercept, \(P_j\) is the fixed effect of time period \(j\), and \(u_k\) is the random effect, a separate intercept for each herd. You can see the model assumes that every individual in the same time period in the same herd has the same probability of having CBPP.

But what is that “logit” thing? That is the link function. It is used to transform the probability, which ranges from 0 to 1 and therefore can’t really have normally distributed error, to a scale that can take any value, positive or negative. Logit is another word for the log-odds function, defined as

\[\text{logit}~p = \log \frac {p}{1-p}\]

Here is a graph of that function:

graph of logit function

You can see that it maps the values ranging from 0 to 1 to values ranging from negative to positive infinity. It is more or less a straight line when the probability is between about 0.25 and 0.75, but starts to get really steep as you get close to the boundaries. This makes sense because if you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a relatively bigger difference.

Fit binomial GLMM

To use statistical jargon, what we are going to do here is fit a binomial GLMM (binomial meaning our response variable is a binary outcome) with a logit link function to transform the predicted probabilities onto the log odds scale.

SAS

proc glimmix data = cbpp;
    class herd period;
    model incidence/size = period / dist = binomial;
    random herd;
    weight size;
    lsmeans period / ilink cl diff oddsratio adjust = tukey;
run;
Model Information
Data Set WORK.CBPP
Response Variable (Events) incidence
Response Variable (Trials) size
Response Distribution Binomial
Link Function Logit
Variance Function Default
Weight Variable size
Variance Matrix Not blocked
Estimation Technique Residual PL
Degrees of Freedom Method Containment
Class Level Information
Class Levels Values
herd 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
period 4 1 2 3 4
Number of Observations Read 56
Number of Observations Used 56
Number of Events 99
Number of Trials 842
Dimensions
G-side Cov. Parameters 1
Columns in X 5
Columns in Z 15
Subjects (Blocks in V) 1
Max Obs per Subject 56
Optimization Information
Optimization Technique Dual Quasi-Newton
Parameters in Optimization 1
Lower Boundaries 1
Upper Boundaries 0
Fixed Effects Profiled
Starting From Data
Iteration History
Iteration Restarts Subiterations Objective
Function
Change Max
Gradient
0 0 5 763.01311438 0.33363629 0.000019
1 0 3 927.32730767 0.09156786 1.469E-6
2 0 1 957.93604425 0.00294224 9.108E-6
3 0 1 958.76611047 0.00000413 2.345E-8
4 0 0 958.76678537 0.00000000 4.042E-8
Convergence criterion (PCONV=1.11022E-8) satisfied.
Fit Statistics
-2 Res Log Pseudo-Likelihood 958.77
Generalized Chi-Square 926.53
Gener. Chi-Square / DF 17.82
Covariance Parameter Estimates
Cov Parm Estimate Standard
Error
herd 0.7536 0.2899
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
period 3 38 75.66 <.0001
period Least Squares Means
period Estimate Standard
Error
DF t Value Pr > |t| Alpha Lower Upper Mean Standard
Error
Mean
Lower
Mean
Upper
Mean
1 -1.5248 0.2283 38 -6.68 <.0001 0.05 -1.9870 -1.0626 0.1788 0.03352 0.1206 0.2568
2 -2.1217 0.2331 38 -9.10 <.0001 0.05 -2.5935 -1.6498 0.1070 0.02227 0.06956 0.1611
3 -2.5721 0.2382 38 -10.80 <.0001 0.05 -3.0543 -2.0899 0.07096 0.01570 0.04503 0.1101
4 -2.6697 0.2441 38 -10.94 <.0001 0.05 -3.1640 -2.1755 0.06478 0.01479 0.04054 0.1020
Differences of period Least Squares Means
Adjustment for Multiple Comparisons: Tukey-Kramer
period _period Estimate Standard Error DF t Value Pr > |t| Adj P Alpha Lower Upper Adj Lower Adj Upper Odds Ratio Lower Confidence Limit for Odds Ratio Upper Confidence Limit for Odds Ratio Adj Lower
Odds Ratio
Adj Upper
Odds Ratio
1 2 0.5969 0.07424 38 8.04 <.0001 <.0001 0.05 0.4466 0.7472 0.3974 0.7963 1.816 1.563 2.111 1.488 2.217
1 3 1.0473 0.08898 38 11.77 <.0001 <.0001 0.05 0.8672 1.2274 0.8083 1.2864 2.850 2.380 3.412 2.244 3.620
1 4 1.1450 0.1032 38 11.09 <.0001 <.0001 0.05 0.9360 1.3539 0.8677 1.4222 3.142 2.550 3.872 2.381 4.146
2 3 0.4504 0.09755 38 4.62 <.0001 0.0002 0.05 0.2530 0.6479 0.1884 0.7125 1.569 1.288 1.912 1.207 2.039
2 4 0.5481 0.1111 38 4.93 <.0001 <.0001 0.05 0.3231 0.7730 0.2496 0.8466 1.730 1.381 2.166 1.283 2.332
3 4 0.09764 0.1219 38 0.80 0.4281 0.8535 0.05 -0.1491 0.3444 -0.2299 0.4251 1.103 0.861 1.411 0.795 1.530

R

In R, we use the glmer() function instead of lmer() in this case. We specify a formula in the same way as we have been doing with lmer(), and the data frame from which the variables come, but we now have a new argument, family. This refers to the “family” of response distributions that we are going to be using. Here we specify binomial(link = 'logit'). Other link functions are possible to use with binomial GLMMs, but we will not get into that today.

Also notice weights = size which accounts for the fact that the herds are different sizes; a bigger herd is contributing more data to our estimate of the incidence of CBPP over time.

glmm_cbpp <- glmer(incidence/size ~ period + (1 | herd), weights = size, data = cbpp, family = binomial(link = 'logit'))

GLMM residual diagnostics

To check this model, it no longer makes sense to look at the residual diagnostic plots from check_model(). We aren’t going to get normally distributed residuals. However, there are still issues we potentially need to check for. The R package DHARMa allows you to simulate residuals from a fitted GLMM object and test them for a few different things. Here we see a normal Q-Q plot of the simulated residuals which shows no issues with deviation from normality (K-S test), overdispersion, or outliers. The homogeneity of variance plot on the right also shows no difference in the variance of the residuals between the four treatments.

sim_resid_cbpp <- simulateResiduals(glmm_cbpp)
plot(sim_resid_cbpp)

DHARMa simulated residual diagnostic plot for CBPP GLMM model

Assess GLMM model predictions

Now we can look at the estimated marginal means of CBPP probability for each of the time periods. Notice we use the argument type = 'response'. This uses the inverse link function to back-transform the mean estimates and the endpoints of the 95% confidence intervals from the log-odds scale to the probability scale. This is much easier to interpret. In SAS the corresponding option is ilink (inverse link) in lsmeans and slice statements.

emmeans(glmm_cbpp, pairwise ~ period, type = 'response', adjust = 'tukey')
$emmeans
 period   prob     SE  df asymp.LCL asymp.UCL
 1      0.1981 0.0367 Inf    0.1357     0.280
 2      0.0839 0.0236 Inf    0.0478     0.143
 3      0.0740 0.0224 Inf    0.0404     0.132
 4      0.0484 0.0196 Inf    0.0216     0.105

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

$contrasts
 contrast          odds.ratio    SE  df null z.ratio p.value
 period1 / period2       2.70 0.817 Inf    1   3.272  0.0059
 period1 / period3       3.09 0.998 Inf    1   3.495  0.0027
 period1 / period4       4.85 2.049 Inf    1   3.743  0.0010
 period2 / period3       1.15 0.431 Inf    1   0.363  0.9837
 period2 / period4       1.80 0.835 Inf    1   1.267  0.5843
 period3 / period4       1.57 0.751 Inf    1   0.945  0.7807

P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log odds ratio scale 

We see that period 1 has an estimated marginal mean of 19.8% disease incidence while the estimated mean incidences for the other three periods are much lower. As a result, the odds ratio is significantly greater than 1 when comparing period 1 to any of the other three periods, but among periods 2, 3, and 4 the odds ratios are not significantly different from 1.

In the end we got similar estimates of the mean incidence by period, but the 95% confidence intervals are now all positive. With the GLMM, we no longer get nonsensical negative probabilities.

Fungicide example data analysis

The dataset

The following example dataset was published in the useful book Analysis of Generalized Linear Mixed models in the Agricultural and Natural Resource Sciences, co-written by some of my ARS statistician colleagues. The SAS code presented in this section is adapted from the book.

Diseased cotton seedlings
Diseased cotton seedlings, photo by University of Missouri Extension

The example dataset comes from a study by C. S. Rothrock where three different fungicides were applied to cotton plants to test how effectively they reduced seedling diseases, and whether the reduction persisted over time. There are four treatments (three fungicides labeled DACD, RT, and TSX and untreated control labeled None) applied in a randomized complete block design (RCBD) with five blocks. There were two rows per plot but we will not consider the multiple rows in our analysis. In each plot there were a total of 400 plants. At three time points (12, 20, and 42 days after planting) the number of diseased plants out of 400 was recorded for each plot.

Because the same plot was measured three times, we cannot consider those three measurements to be independent data points. We must model the relationship between them. You may see this called “repeated measures” or “split plot in time.”

NOTE: SAS output is not shown for the code presented in this section. Feel free to try out the SAS code on your own!

SAS

filename csvFile2 url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv";
proc import datafile = csvFile2 out = fungicide replace dbms = csv; run;

R

Import the data.

If you are running this on the cloud server:

fungicide <- read_csv('data/fungicide.csv')
Rows: 60 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): fungicide
dbl (7): block, plot, n_plants, time, n_healthy_row1, n_healthy_row2, n_healthy

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you are running this on your own machine:

fungicide <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv') 

Change the appropriate columns to factors. Set the untreated control, 'None', to the reference level for fungicide.

fungicide <- fungicide %>%
  mutate(
    block = factor(block),
    plot = factor(plot),
    time = factor(time),
    fungicide = relevel(factor(fungicide), ref = 'None')
  )

Exploratory plots

Let’s take a look at the data. First let’s make a plot with a point for each of the raw data values. In addition, a semi-transparent horizontal bar is placed at the median for each treatment at each time point; we use stat_summary() to calculate a summary statistic and specify the shape of the point. We use position_dodge() to horizontally separate the points for each treatment. (Inspiration for this plot comes from this post about ggplot2.) We can see from the plot that the number of healthy plants decreases over time. The no-fungicide control tends to have fewer healthy plants than the three treated groups, but no big difference between the three fungicides jumps out at you.

PROTIP: A colorblind-friendly color scheme is used here because the default ggplot2 colors are not easily distinguishable by red-green colorblind people.

R

ggplot(fungicide, aes(x = time, color = fungicide, group = fungicide, y = n_healthy)) +
  geom_point(size = 2, position = position_dodge(width = .5)) +
  stat_summary(fun = 'median', geom = 'point', shape = 95, size = 10, position = position_dodge(width = .5), alpha = .5) +
  theme_bw() +
  scale_color_okabeito(order = c(9, 1, 2, 3))

plot of fungicide data showing variation due to treatment and time

To visualize whether there is an effect of block, plot the same data in a different way. Now, color indicates block instead of treatment. Fungicide treatments are split into different panels with facet_wrap(). Also, connect the same plot observed over time with lines to help visualize. From this plot, it looks like there is some effect of block. Block 5 tends to have more healthy plants at all time points regardless of treatment. Blocks 1 and 2 have a little bit fewer. Any model that we fit is going to have to account for this block-level difference. It will make it easier for us to pull out the overall treatment effect while accounting for random variation among the blocks.

ggplot(fungicide, aes(x = time, color = block, y = n_healthy)) +
  geom_line(aes(group = block)) +
  geom_point(size = 2) +
  facet_wrap(~ fungicide) +
  theme_bw() +
  scale_color_okabeito()

plot of fungicide data showing variation due to block

SAS

Similar plots to the ones created above in R can be made with the following SAS code (output not shown).

proc sgplot data = fungicide;
    vbox n_healthy / category=time group=fungicide groupdisplay=cluster;
run;

proc sgpanel data = fungicide;
    panelby fungicide;
    vline time / response=n_healthy group=block;
run;

Fit models

R

To fit models to this dataset, we are going to need to use a more advanced linear mixed modeling package, glmmTMB. This package allows us to fit more complex covariance structures to the data.

Here’s how we will set up the model formula.

  • Our response variable is two column vectors side-by-side: cbind(n_healthy, n_plants - n_healthy).
    • The first value is the number of “successes” or healthy plants.
    • The second value is the number of “failures” or the total number of plants minus the number of healthy plants.
  • Our fixed effects are fungicide, time, and their interaction: fungicide + time + fungicide:time
  • There is a random intercept term for each block: (1 | block)
  • Next, we want to account for the repeated measures of plots within blocks. We are going to try three different ways of doing this, each of which represents a different way that correlation can exist between different time points in the same plot.
    • First, we will try an unstructured covariance model, with a separately estimated covariance parameter for each pair of time points.
    • Second, we will try a compound symmetry model, which fixes the covariance parameter for each pair of time points at a single value. Measurements from the same plot at different times are modeled as being correlated, but the correlation is equally strong no matter how far apart in time the time points are.
    • Finally, we will try a first-order autoregressive or AR(1) model, which has just as few parameters as the compound symmetry model because it relates each time point to the one preceding it, assuming the same covariance between each pair of consecutive time points. In this model, measurements from the same plot at different times are modeled as being correlated, with stronger correlations for points closer together in time.

Note we use 0 + time to exclude the intercept from the residual covariance, only fitting it for the time slope.

fit_us <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + us(0 + time | fungicide:block), data = fungicide, family = binomial)
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; false convergence (8). See vignette('troubleshooting'),
help('diagnose')
fit_cs <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + cs(0 + time | fungicide:block), data = fungicide, family = binomial)

fit_ar1 <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + ar1(0 + time | fungicide:block), data = fungicide, family = binomial)

As it turns out, the unstructured covariance model does not even converge. This is because it has too many parameters for the small dataset we are fitting the model to. When you get a warning like that, it means the model predictions aren’t valid. Resist the temptation to look at them!

The compound symmetry model and the AR1 model are a little bit less flexible but have fewer parameters. This is a good thing because it helps us avoid overfitting on our relatively small dataset. We do not get any warnings about model convergence with those simpler models.

Residual diagnostics and model comparison

Let’s check model diagnostics for the CS and AR1 models. Again we need to simulate residuals from the fitted model objects. The residual plots look great.

sim_resid_csmodel <- simulateResiduals(fit_cs)
plot(sim_resid_csmodel)

DHARMa simulated residual diagnostic plot for fungicide model

sim_resid_ar1model <- simulateResiduals(fit_ar1)
plot(sim_resid_ar1model)

DHARMa simulated residual diagnostic plot for fungicide model

Let’s compare the AIC (Akaike Information Criterion) of the CS and AR(1) models. AIC measures how well the model fits the data but assesses a penalty for having too many parameters. Lower means a better fit, accounting for potential overfitting.

AIC(fit_cs, fit_ar1)
        df      AIC
fit_cs  17 497.4627
fit_ar1 15 496.8801

We see that the two models have nearly identical AIC values. Thus, we will use the autoregressive model due to its greater simplicity, as indicated by the lower number of degrees of freedom. The difference is small enough that either one is probably acceptable.

Assess model predictions

Now, we can look at the results. There are a few different marginal means we can estimate. We can look at the effect of fungicide averaged across times as well as compare the effect of fungicide within each time point.

emm_fung <- emmeans(fit_ar1, pairwise ~ fungicide, type = 'response', adjust = 'tukey')
NOTE: Results may be misleading due to involvement in interactions
emm_fung_by_time <- emmeans(fit_ar1, pairwise ~ fungicide | time, type = 'response', adjust = 'tukey')

emm_fung
$emmeans
 fungicide  prob     SE  df asymp.LCL asymp.UCL
 None      0.734 0.0159 Inf     0.702     0.764
 DACD      0.789 0.0137 Inf     0.761     0.815
 RT        0.754 0.0152 Inf     0.723     0.782
 TSX       0.771 0.0145 Inf     0.741     0.798

Results are averaged over the levels of: time 
Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

$contrasts
 contrast    odds.ratio     SE  df null z.ratio p.value
 None / DACD      0.737 0.0555 Inf    1  -4.047  0.0003
 None / RT        0.904 0.0674 Inf    1  -1.356  0.5273
 None / TSX       0.823 0.0617 Inf    1  -2.594  0.0468
 DACD / RT        1.226 0.0926 Inf    1   2.694  0.0356
 DACD / TSX       1.117 0.0847 Inf    1   1.455  0.4649
 RT / TSX         0.911 0.0685 Inf    1  -1.239  0.6020

Results are averaged over the levels of: time 
P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log odds ratio scale 
emm_fung_by_time
$emmeans
time = 12:
 fungicide  prob     SE  df asymp.LCL asymp.UCL
 None      0.779 0.0168 Inf     0.744     0.810
 DACD      0.819 0.0148 Inf     0.788     0.846
 RT        0.794 0.0161 Inf     0.761     0.824
 TSX       0.815 0.0150 Inf     0.784     0.843

time = 20:
 fungicide  prob     SE  df asymp.LCL asymp.UCL
 None      0.721 0.0192 Inf     0.682     0.757
 DACD      0.779 0.0168 Inf     0.745     0.810
 RT        0.735 0.0187 Inf     0.696     0.769
 TSX       0.758 0.0177 Inf     0.722     0.791

time = 42:
 fungicide  prob     SE  df asymp.LCL asymp.UCL
 None      0.699 0.0199 Inf     0.659     0.737
 DACD      0.767 0.0173 Inf     0.732     0.799
 RT        0.728 0.0189 Inf     0.690     0.764
 TSX       0.732 0.0187 Inf     0.694     0.767

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

$contrasts
time = 12:
 contrast    odds.ratio     SE  df null z.ratio p.value
 None / DACD      0.775 0.0836 Inf    1  -2.358  0.0853
 None / RT        0.912 0.0970 Inf    1  -0.871  0.8201
 None / TSX       0.798 0.0858 Inf    1  -2.103  0.1522
 DACD / RT        1.176 0.1277 Inf    1   1.490  0.4438
 DACD / TSX       1.029 0.1129 Inf    1   0.257  0.9941
 RT / TSX         0.875 0.0948 Inf    1  -1.233  0.6055

time = 20:
 contrast    odds.ratio     SE  df null z.ratio p.value
 None / DACD      0.733 0.0760 Inf    1  -2.994  0.0146
 None / RT        0.935 0.0954 Inf    1  -0.662  0.9114
 None / TSX       0.823 0.0847 Inf    1  -1.889  0.2327
 DACD / RT        1.275 0.1327 Inf    1   2.335  0.0903
 DACD / TSX       1.123 0.1178 Inf    1   1.109  0.6841
 RT / TSX         0.881 0.0909 Inf    1  -1.228  0.6090

time = 42:
 contrast    odds.ratio     SE  df null z.ratio p.value
 None / DACD      0.705 0.0724 Inf    1  -3.400  0.0038
 None / RT        0.867 0.0879 Inf    1  -1.412  0.4916
 None / TSX       0.850 0.0863 Inf    1  -1.600  0.3787
 DACD / RT        1.229 0.1271 Inf    1   1.990  0.1915
 DACD / TSX       1.205 0.1248 Inf    1   1.804  0.2712
 RT / TSX         0.981 0.1003 Inf    1  -0.187  0.9977

P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log odds ratio scale 

The above results show that all of the fungicides’ effect on increasing the probability of healthy seedlings is modest at best. The marginal probability of healthy plants in the untreated control is 73.4%. The fungicides increase this by a few percentage points. Averaged across all three time points, we see that the odds ratio comparing DACD and TSX to the untreated control is significantly less than 1, indicating some level of effectiveness. However, we cannot reject the null hypothesis that RT has the same odds of healthy plants as the control.

Separating by time point, we can see that the estimated probability of healthy plants decreases over time. There is modest evidence for an interaction between fungicide and time. The DACD fungicide is generally the most effective, and RT is generally ineffective. The interaction with time is seen because the effectiveness of DACD fungicide relative to the untreated control decreases over time.

Let’s also make some plots to better visualize the model predictions. We will use the cld() function from the multcomp package to summarize the multiple comparisons with letters. We will annotate the means and 95% confidence interval error bars with the letters. The multiple comparisons are Tukey-adjusted. By default Sidak adjustment is made to the 95% confidence intervals, as you can tell from the note that displays when the code runs. That’s why the intervals shown on this plot are wider than the ones in the table above. You can print the CLD object to see the adjusted intervals.

The results averaged across time points and within each time point are combined into a single data frame to produce a multi-panel plot.

cld_fung <- cld(emm_fung$emmeans, adjust = 'tukey', Letters = letters)
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
cld_fung_by_time <- cld(emm_fung_by_time$emmeans, adjust = 'tukey', Letters = letters)
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
plot_data <- bind_rows(as_tibble(cld_fung), as_tibble(cld_fung_by_time)) %>%
  mutate(time = fct_na_value_to_level(time, 'overall') %>%
           factor(labels = c(paste(c(12, 20, 42), 'days after planting'), 'overall')))

ggplot(plot_data, aes(x = fungicide, y = prob, ymin = asymp.LCL, ymax = asymp.UCL, label = trimws(.group))) +
  geom_errorbar(linewidth = 0.9, width = 0.2) +
  geom_point(size = 3) +
  geom_text(aes(y  = asymp.UCL), vjust = -0.5, size = 5) +
  facet_wrap(~ time) +
  theme_bw() +
  scale_y_continuous(name = 'probability of healthy plant', limits = c(0.6, 0.9))

estimated marginal mean plot of fungicide by time

SAS

Similar models to the models fit above in R can be fit, and means comparisons evaluated, with the following code (output not shown). Because method = laplace, a maximum-likelihood method, is specified, AIC values are produced with the default output. The unstructured covariance model also fails to converge, just like we observed using R.

/* Unstructured covariance */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = un;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* Compound symmetry */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = cs;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* AR1 */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = ar(1);
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

Conclusion

What have you learned in this lesson?

  • What GLMMs are and how link functions work
  • How to fit GLMMs to repeated measures binomial data
  • How to fit GLMMs with different error covariance structures
  • How to compare means from GLMMs and plot them

Nice job! Pat yourselves on the back once again!

Exercises

As in the previous lessons, here are some exercises for additional practice. Also as in the previous lessons, these exercises give you some SAS code and ask you to write some R code that does the same job or close to it. For this set of exercises, we will use another dataset from the agridat package: repeated measures of lettuce weights originally collected and analyzed by Levi Pederson for his MS thesis.

The lettuce plants were grown hydroponically and were randomly assigned to three different density treatments: treatment 1 had the lowest density, treatment 2 intermediate, and treatment 3 the highest. Plants were weighed twice a week for 40 days. More details can be found in the help documentation: ?pederson.lettuce.repeated.

Load the data like this:

library(agridat)
data('pederson.lettuce.repeated')

Here are trends of weight over time for the different plants. I’ve also plotted a line with the median weight by day for each treatment.

ggplot(pederson.lettuce.repeated, aes(x = day, y = weight, color = trt)) +
  geom_line(aes(group = plant), linewidth = .5, alpha = .5) +
  stat_summary(fun = 'median', geom = 'line', linewidth = 1.5) +
  theme_bw() +
  scale_color_okabeito()

Line charts of weight over time for each plant, colored by treatment, with medians

Here is a histogram of the weight data across all treatments and time points.

ggplot(pederson.lettuce.repeated, aes(x = weight)) + 
  geom_histogram() +
  theme_bw() +
  scale_y_continuous(expand = expansion(add = c(0, 1)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram of all lettuce weight data points

Exercise 1

Think about a model formula to analyze these data.

    1. What is the response variable?
    1. What should the fixed effect(s) in the model be?
    1. How are the repeated measures grouped?

Exercise 2

Use glmmTMB() to fit a model to this dataset. Use compound symmetry covariance structure to account for the repeated measures.

In SAS, the class, model, and random statements for a similar model fit with PROC GLIMMIX would be:

class plant trt day;
model weight = trt day trt*day;
random day / subject=plant residual type=cs;
  • Hint: The random effects part of the model formula will be ...(0 + day | plant) but what should you put in place of the ...?

Exercise 3

Simulate residuals and plot the simulated residual diagnostics. What do you notice?

  • Hint: Use the function simulateResiduals().

Exercise 4

Fit the model again with family = 'lognormal' to account for the positive, right-skewed distribution of weights seen in the histogram. Simulate residuals again and plot. What do you notice now?

In SAS, the model statement would be:

model weight = trt day trt*day / dist=lognormal;

Exercise 5

Fit the model again with the lognormal response distribution and AR(1) covariance structure. Check residual diagnostics. Use AIC to compare this model with the compound symmetry lognormal model you fit in the previous exercise. What can you conclude?

In SAS, the model statement would be:

random day / subject=plant residual type=ar(1);
  • Hint: 0 + factor(day) must be in your random effects specification.

Exercise 6

Use the following code to estimate treatment means at day 1, day 21, and day 40, for the best model you identified so far. Replace the ... with the name of the fitted model object.

trt_means_by_day <- emmeans(..., ~ trt | day, at = list(day = c(1, 21, 40)), type = 'response')

Now use cld with Tukey adjustment to determine whether the difference between the means of any pair of treatments is significant at any of the time points. What can you conclude?

In SAS, the corresponding statement, which would produce means and comparisons for all days, would be:

slice trt*day / sliceby=day diff cl ilink linestable adjust=tukey; 

Click here for answers