Introduction

What is this workshop?

Welcome back to the R for SAS users workshop! This workshop is intended for SAS users who want to learn R. The people who will get the most out of this course are practicing researchers who have a decent working knowledge of SAS, and of basic statistical analysis (descriptive stats and regression models) as it applies to their field.

This is lesson 3 of 3 in a series. Lesson 1 covered the basics: importing data, cleaning and reshaping data, summary statistics, simple graphs and tables, and a few simple statistical models. Lesson 2 got a little more advanced, covering linear mixed models for more sophisticated experimental designs and how to produce and compare estimated marginal means.

Download the worksheet for this lesson here.

IMPORTANT NOTE: In this lesson, the numerical results of the R and SAS code may no longer be identical, as they were in previous lessons. This is because different fitting algorithms are used by SAS PROC GLIMMIX and by the R model fitting packages that we are demonstrating. A full discussion of these differences is outside the scope of this lesson!

What will you learn from this workshop?

Conceptual learning objectives

During this workshop, you will …

Learn what generalized linear mixed models (GLMMs) are
Learn more about different predictions you can make from models
Learn about how to deal with more complex covariance structures

Practical skills

As in Lessons 1 and 2, we will work through a “data to doc” pipeline in R, comparing R code to SAS code each step of the way. We will use yet another different dataset.

We will …

Import the data from a CSV file
Clean and reshape the data
Calculate some summary statistics and make some exploratory plots
Fit a generalized linear mixed-effects model with repeated measures error structure
Make plots and tables of results

How to follow along with this workshop

Slides and text version of lessons are online
Fill in R code in the worksheet (replace ... with code)
This lesson also includes a template notebook that you can fill in
You can always copy and paste code from text version of lesson if you fall behind
Notes on best practices will be marked with PROTIP as we go along!

Cattle pneumonia example data analysis

As in the previous lessons, we will start with raw data and work our way to a finished product. Hopefully this is becoming second nature to you by now!

Load R packages

Here we’ll load the R packages we are going to work with today. These are mostly the same as the previous lessons. This includes the tidyverse package for reading, manipulating, and plotting data, the lme4 package for fitting linear mixed models, and the easystats package which has some good model diagnostic plots. Now we’re also using glmmTMB, a more advanced mixed model fitting package and DHARMa for GLMM model residual diagnostic plots. Set a default plotting theme as well.

library(tidyverse)
library(lme4)
library(easystats)
library(emmeans)
library(multcomp)
library(glmmTMB)
library(DHARMa)

theme_set(theme_bw())

The dataset

The first dataset we will use for this lesson is the cbpp or contagious bovine pleuropneumonia dataset. It is pre-loaded with the lme4 package. The number of Ethiopian zebu cattle that developed the disease in each herd, and the total number of cattle in the herd, is recorded, for each of four time periods. The herds (1-15) are identified with numerical IDs, and the time periods are identified by the integers 1-4. Note the period column is already a factor variable when you examine the pre-loaded dataset.

Zebu, photo by Scott Bauer, USDA-ARS

PROTIP: You can call ?cbpp in the R console to see documentation for the dataset. This can be done for any example dataset that comes pre-loaded with a package.

Note I’ve made a copy of the cbpp dataset as a CSV file so that you can import the same dataset into SAS if you want to follow along with the SAS code.

SAS

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/cbpp.csv";
proc import datafile = csvFile out = cbpp replace dbms = csv; run;

data cbpp; set cbpp;
    rate = incidence/size;
run;

R

data(cbpp)
glimpse(cbpp)

Rows: 56
Columns: 4
$ herd      <fct> 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, …
$ incidence <dbl> 2, 3, 4, 0, 3, 1, 1, 8, 2, 0, 2, 2, 0, 2, 0, 5, 0, 0, 1, 3, …
$ size      <dbl> 14, 12, 9, 5, 22, 18, 21, 22, 16, 16, 20, 10, 10, 9, 6, 18, …
$ period    <fct> 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, …

Exploratory plots

Let’s make a plot of the CBPP data. We will plot the proportion incidence/size for each herd at each time period.

SAS

proc sgplot data = cbpp noautolegend;
    styleattrs datacontrastcolors = (black) datalinepatterns = (solid);
    series x = period y = rate / group = herd;
run;

R

ggplot(cbpp, aes(x = period, y = incidence/size, group = herd)) +
  geom_line() +
  theme_bw()

Plot of raw CBPP incidence data by time period and herd

We can see from the above that the proportion of cows in each herd that show disease symptoms varies quite a lot between herds and over time.

Fit linear model

As we’ve learned in the previous lessons, if we have non-independence in our measurements, we have to account for that using a linear mixed model. We definitely have that in this case because the same herds were sampled four times, so measurements from period 1-4 from the same herd are not independent of one another.

So here’s a linear mixed model fit using lmer() (SAS code not presented here). The response variable is the proportion of cattle suffering from CBPP, incidence/size, the fixed effect is the time period treated as a discrete categorical variable, period, and there is a random intercept for each herd, (1 | herd).

lmm_cbpp <- lmer(incidence/size ~ period + (1 | herd), data = cbpp)

So far so good, right?

Linear model residual diagnostics

The model diagnostic plots also look reasonably good for a small dataset.

check_model(lmm_cbpp)

Residual diagnostic plot for CBPP linear model

Assess linear model predictions

Let’s get the model’s estimate of the mean and 95% confidence interval of proportion incidence of CBPP for each of the four periods.

emmeans(lmm_cbpp, ~ period)

 period emmean     SE df lower.CL upper.CL
 1      0.2198 0.0323 52   0.1550    0.285
 2      0.0741 0.0335 52   0.0069    0.141
 3      0.0894 0.0335 52   0.0222    0.157
 4      0.0413 0.0348 52  -0.0286    0.111

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95

The estimated marginal means decrease over time from period 1 (0.22) to period 4 (0.04). But the 95% confidence limits indicate a problem. The lower 95% confidence limit on the mean disease incidence proportion is negative! It is not possible to have a proportion less than 0 or greater than 1. Our assumption of normally distributed error around the mean proportions isn’t valid.

Generalized linear mixed model

This is where we need to use generalized linear mixed models (GLMMs). GLMMs are used in situations where the response variable does not have a normal distribution. Just like the simpler general linear models we’ve been dealing with up to this point, GLMMs have a linear predictor term that is made of a combination of fixed and random effects. Because you’re adding and multiplying terms together, the linear predictor term can take any value, positive or negative. When you have normally distributed responses, that is fine because a normal distribution can also take any value, positive or negative. But what if the response can’t take any value? What if it’s a probability that has to be between 0 and 1? Or a rate that has to be positive? Then the scale of the predictor and the scale of the response don’t match.

What do GLMMs do to deal with this mismatch? They use a link function to convert the response onto a scale where it can also take any value, positive or negative. Then we fit the model and generate all the predictions in the link function space. We can always apply the inverse of the link function to those predictions, to get them back in the original data scale. This is even better than transforming your data before you fit the model, because it avoids a bias that is introduced by transforming and back-transforming.

Here, our response is binary (either a cow has the disease or it doesn’t). Modeling a binary response can be done with a generalized linear model. We use a link function to convert the predicted probabilities into a scale that can take any value, positive or negative. What we are predicting is the probability that an individual cow from a given herd in a given time period will have CBPP. Probability ranges from 0 to 1, so it cannot really have normally distributed error because normal distributions (bell curves) can take any value, positive or negative. So a bounded range like 0 to 1 doesn’t really play well with normal distributions. That’s why we have to use a GLMM.

The formula for predicting the probability of disease for individual \(i\) at time period \(j\) in herd \(k\) is as follows:

\[\text{logit}~\hat{y}_{i,j,k} = \beta_0 + P_j + u_k\]

In this formula, \(\hat{y}_{i,j,k}\) is the predicted probability of disease for individual \(i\) at time period \(j\) in herd \(k\). \(\beta_0\) is the intercept, \(P_j\) is the fixed effect of time period \(j\), and \(u_k\) is the random effect, a separate intercept for each herd. You can see the model assumes that every individual in the same time period in the same herd has the same probability of having CBPP. Notice also that the right-hand side, the linear predictor, is the sum of three terms. Because those terms could be positive or negative, the right-hand side could also be any value, positive or negative.

But what is that “logit” thing? That is the link function. It is used to transform the probability, which ranges from 0 to 1 and therefore can’t really have normally distributed error, to a scale that can take any value. Logit is another word for the log-odds function, defined as

\[\text{logit}~p = \log \frac {p}{1-p}\]

Here is a graph of that function:

graph of logit function

You can see that it maps the values ranging from 0 to 1 to values ranging from negative to positive infinity. It is more or less a straight line when the probability is between about 0.25 and 0.75, but starts to get really steep as you get close to the boundaries. This makes sense because if you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a relatively bigger difference.

Fit binomial GLMM

To use statistical jargon, what we are going to do here is fit a binomial GLMM (binomial meaning our response variable is a binary outcome) with a logit link function to transform the predicted probabilities onto the log odds scale.

SAS

proc glimmix data = cbpp;
    class herd period;
    model incidence/size = period / dist = binomial;
    random herd;
    weight size;
    lsmeans period / ilink cl diff oddsratio adjust = tukey;
run;


Model Information
Data Set	WORK.CBPP
Response Variable (Events)	incidence
Response Variable (Trials)	size
Response Distribution	Binomial
Link Function	Logit
Variance Function	Default
Weight Variable	size
Variance Matrix	Not blocked
Estimation Technique	Residual PL
Degrees of Freedom Method	Containment


Class Level Information
Class	Levels	Values
herd	15	1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
period	4	1 2 3 4


Number of Observations Read	56
Number of Observations Used	56
Number of Events	99
Number of Trials	842


Dimensions
G-side Cov. Parameters	1
Columns in X	5
Columns in Z	15
Subjects (Blocks in V)	1
Max Obs per Subject	56


Optimization Information
Optimization Technique	Dual Quasi-Newton
Parameters in Optimization	1
Lower Boundaries	1
Upper Boundaries	0
Fixed Effects	Profiled
Starting From	Data


Iteration History
Iteration	Restarts	Subiterations	Objective Function	Change	Max Gradient
0	0	5	763.01311438	0.33363629	0.000019
1	0	3	927.32730767	0.09156786	1.469E-6
2	0	1	957.93604425	0.00294224	9.108E-6
3	0	1	958.76611047	0.00000413	2.345E-8
4	0	0	958.76678537	0.00000000	4.042E-8


Convergence criterion (PCONV=1.11022E-8) satisfied.


Fit Statistics
-2 Res Log Pseudo-Likelihood	958.77
Generalized Chi-Square	926.53
Gener. Chi-Square / DF	17.82


Covariance Parameter Estimates
Cov Parm	Estimate	Standard Error
herd	0.7536	0.2899


Type III Tests of Fixed Effects
Effect	Num DF	Den DF	F Value	Pr > F
period	3	38	75.66	<.0001


period Least Squares Means
period	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Alpha	Lower	Upper	Mean	Standard Error Mean	Lower Mean	Upper Mean
1	-1.5248	0.2283	38	-6.68	<.0001	0.05	-1.9870	-1.0626	0.1788	0.03352	0.1206	0.2568
2	-2.1217	0.2331	38	-9.10	<.0001	0.05	-2.5935	-1.6498	0.1070	0.02227	0.06956	0.1611
3	-2.5721	0.2382	38	-10.80	<.0001	0.05	-3.0543	-2.0899	0.07096	0.01570	0.04503	0.1101
4	-2.6697	0.2441	38	-10.94	<.0001	0.05	-3.1640	-2.1755	0.06478	0.01479	0.04054	0.1020


Differences of period Least Squares Means Adjustment for Multiple Comparisons: Tukey-Kramer
period	_period	Estimate	Standard Error	DF	t Value	Pr > \|t\|	Adj P	Alpha	Lower	Upper	Adj Lower	Adj Upper	Odds Ratio	Lower Confidence Limit for Odds Ratio	Upper Confidence Limit for Odds Ratio	Adj Lower Odds Ratio	Adj Upper Odds Ratio
1	2	0.5969	0.07424	38	8.04	<.0001	<.0001	0.05	0.4466	0.7472	0.3974	0.7963	1.816	1.563	2.111	1.488	2.217
1	3	1.0473	0.08898	38	11.77	<.0001	<.0001	0.05	0.8672	1.2274	0.8083	1.2864	2.850	2.380	3.412	2.244	3.620
1	4	1.1450	0.1032	38	11.09	<.0001	<.0001	0.05	0.9360	1.3539	0.8677	1.4222	3.142	2.550	3.872	2.381	4.146
2	3	0.4504	0.09755	38	4.62	<.0001	0.0002	0.05	0.2530	0.6479	0.1884	0.7125	1.569	1.288	1.912	1.207	2.039
2	4	0.5481	0.1111	38	4.93	<.0001	<.0001	0.05	0.3231	0.7730	0.2496	0.8466	1.730	1.381	2.166	1.283	2.332
3	4	0.09764	0.1219	38	0.80	0.4281	0.8535	0.05	-0.1491	0.3444	-0.2299	0.4251	1.103	0.861	1.411	0.795	1.530

R

In R, we use the glmer() function instead of lmer() in this case. We specify a formula in the same way as we have been doing with lmer(), and the data frame from which the variables come, but we now have a new argument, family. This refers to the “family” of response distributions that we are going to be using. Here we specify binomial(link = 'logit'). Other link functions are possible to use with binomial GLMMs, but we will not get into that today.

Also notice weights = size which accounts for the fact that the herds are different sizes; a bigger herd is contributing more data to our estimate of the incidence of CBPP over time.

glmm_cbpp <- glmer(incidence/size ~ period + (1 | herd), weights = size, data = cbpp, family = binomial(link = 'logit'))

GLMM residual diagnostics

To check this model, it no longer makes sense to look at the residual diagnostic plots from check_model(). We aren’t going to get normally distributed residuals. However, there are still issues we potentially need to check for. The R package DHARMa allows you to simulate residuals from a fitted GLMM object and test them for a few different things. Here we see a normal Q-Q plot of the simulated residuals which shows no issues with deviation from normality (K-S test), overdispersion, or outliers. The homogeneity of variance plot on the right also shows no difference in the variance of the residuals between the four treatments.

sim_resid_cbpp <- simulateResiduals(glmm_cbpp)
plot(sim_resid_cbpp)

DHARMa simulated residual diagnostic plot for CBPP GLMM model

Assess GLMM model predictions

Now we can look at the estimated marginal means of CBPP probability for each of the time periods. Notice we use the argument type = 'response'. This uses the inverse link function to back-transform the mean estimates and the endpoints of the 95% confidence intervals from the log-odds scale to the probability scale. This is much easier to interpret. In SAS the corresponding option is ilink (inverse link) in lsmeans and slice statements.

emmeans(glmm_cbpp, pairwise ~ period, type = 'response', adjust = 'tukey')

$emmeans
 period   prob     SE  df asymp.LCL asymp.UCL
 1      0.1981 0.0367 Inf    0.1357     0.280
 2      0.0839 0.0236 Inf    0.0478     0.143
 3      0.0740 0.0224 Inf    0.0404     0.132
 4      0.0484 0.0196 Inf    0.0216     0.105

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

$contrasts
 contrast          odds.ratio    SE  df null z.ratio p.value
 period1 / period2       2.70 0.817 Inf    1   3.272  0.0059
 period1 / period3       3.09 0.998 Inf    1   3.495  0.0027
 period1 / period4       4.85 2.050 Inf    1   3.743  0.0010
 period2 / period3       1.15 0.431 Inf    1   0.363  0.9837
 period2 / period4       1.80 0.835 Inf    1   1.267  0.5843
 period3 / period4       1.57 0.751 Inf    1   0.945  0.7807

P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log odds ratio scale

We see that period 1 has an estimated marginal mean of 19.8% disease incidence while the estimated mean incidences for the other three periods are much lower. As a result, the odds ratio is significantly greater than 1 when comparing period 1 to any of the other three periods, but among periods 2, 3, and 4 the odds ratios are not significantly different from 1.

In the end we got similar estimates of the mean incidence by period, but the 95% confidence intervals are now all positive. With the GLMM, we no longer get nonsensical negative probabilities.

Fungicide example data analysis

The dataset

The following example dataset was published in the useful book Analysis of Generalized Linear Mixed models in the Agricultural and Natural Resource Sciences, co-written by some of my ARS statistician colleagues. The SAS code presented in this section is adapted from the book.

Diseased cotton seedlings, photo by University of Missouri Extension

The example dataset comes from a study by C. S. Rothrock where three different fungicides were applied to cotton plants to test how effectively they reduced seedling diseases, and whether the reduction persisted over time. There are four treatments (three fungicides labeled DACD, RT, and TSX and untreated control labeled None) applied in a randomized complete block design (RCBD) with five blocks. There were two rows per plot but we will not consider the multiple rows in our analysis. In each plot there were a total of 400 plants. At three time points (12, 20, and 42 days after planting) the number of diseased plants out of 400 was recorded for each plot.

Because the same plot was measured three times, we cannot consider those three measurements to be independent data points. We must model the relationship between them. You may see this called “repeated measures” or “split plot in time.”

NOTE: SAS output is not shown for the code presented in this section. Feel free to try out the SAS code on your own!

SAS

filename csvFile2 url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv";
proc import datafile = csvFile2 out = fungicide replace dbms = csv; run;

R

Import the data.

If you are running this on the cloud server:

fungicide <- read_csv('data/fungicide.csv')

Rows: 60 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): fungicide
dbl (7): block, plot, n_plants, time, n_healthy_row1, n_healthy_row2, n_healthy

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you are running this on your own machine:

fungicide <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv')

Change the appropriate columns to factors. Set the untreated control, 'None', to the reference level for fungicide.

fungicide <- fungicide %>%
  mutate(
    block = factor(block),
    plot = factor(plot),
    time = factor(time),
    fungicide = relevel(factor(fungicide), ref = 'None')
  )

Exploratory plots

Let’s take a look at the data. First let’s make a plot with a point for each of the raw data values. In addition, a semi-transparent horizontal bar is placed at the median for each treatment at each time point; we use stat_summary() to calculate a summary statistic and specify the shape of the point. We use position_dodge() to horizontally separate the points for each treatment. (Inspiration for this plot comes from this post about ggplot2.) We can see from the plot that the number of healthy plants decreases over time. The no-fungicide control tends to have fewer healthy plants than the three treated groups, but no big difference between the three fungicides jumps out at you.

PROTIP: A colorblind-friendly color scheme is used here because the default ggplot2 colors are not easily distinguishable by red-green colorblind people.

R

ggplot(fungicide, aes(x = time, color = fungicide, group = fungicide, y = n_healthy)) +
  geom_point(size = 2, position = position_dodge(width = .5)) +
  stat_summary(fun = 'median', geom = 'point', shape = 95, size = 10, position = position_dodge(width = .5), alpha = .5) +
  theme_bw() +
  scale_color_okabeito(order = c(9, 1, 2, 3))

plot of fungicide data showing variation due to treatment and time

To visualize whether there is an effect of block, plot the same data in a different way. Now, color indicates block instead of treatment. Fungicide treatments are split into different panels with facet_wrap(). Also, connect the same plot observed over time with lines to help visualize. From this plot, it looks like there is some effect of block. Block 5 tends to have more healthy plants at all time points regardless of treatment. Blocks 1 and 2 have a little bit fewer. Any model that we fit is going to have to account for this block-level difference. It will make it easier for us to pull out the overall treatment effect while accounting for random variation among the blocks.

ggplot(fungicide, aes(x = time, color = block, y = n_healthy)) +
  geom_line(aes(group = block)) +
  geom_point(size = 2) +
  facet_wrap(~ fungicide) +
  theme_bw() +
  scale_color_okabeito()

plot of fungicide data showing variation due to block

SAS

Similar plots to the ones created above in R can be made with the following SAS code (output not shown).

proc sgplot data = fungicide;
    vbox n_healthy / category=time group=fungicide groupdisplay=cluster;
run;

proc sgpanel data = fungicide;
    panelby fungicide;
    vline time / response=n_healthy group=block;
run;

Background: residual covariance structures

Up to now, we’ve assumed that residuals in our statistical models are all conditionally independent. In other words, after accounting for any fixed and random effects, one residual has no relationship to any other. Those models still had a variance-covariance matrix for the residuals; it just had zero for any of the covariances between different residuals. If we call that matrix \(\textbf{R}\) then our models with independent residuals had this covariance structure:

\(\textbf{R} = \sigma^2\textbf{I}\)

That’s a \(n \times n\) matrix, where \(n\) is the number of time points. It has \(\sigma^2\) on every element of the diagonal, meaning that all residuals have the same variance. It has 0 everywhere else, meaning there is 0 covariance between any pair of residuals.

The big step we are taking here is to put some non-zero values into the \(\textbf{R}\) matrix to model correlation between the residuals. But how do we do that? There are lots of different ways we could think about modeling that so-called “residual autocorrelation.” We’ll go through three of the simplest ones here.

Unstructured covariance: This is the most flexible model but also the most complicated, with many parameters. Each pair of times is modeled as having its own unique correlation. Every single off-diagonal element of the \(\textbf{R}\) matrix is a separate parameter that we have to estimate.
Compound symmetry: This is the simplest model with the fewest parameters. A single correlation parameter \(\rho\) is estimated. Any two time points from the same subject (here, plot) are modeled as being equally strongly correlated, regardless of how close together or how far apart in time they are. That means every single off-diagonal element of the \(\textbf{R}\) matrix is forced to be the same non-zero value, \(\rho\).
First-order autoregressive: This is also known as AR(1). This structure considers the correlation between two times from the same plot to be highest if the times are adjacent and decrease systematically as the two times get further apart. It has just as few parameters as the compound symmetry structure. A single correlation \(\rho\) is estimated for all pairs of points 1 time unit apart. Pairs of points 2 time units apart are correlated at \(\rho^2\), then \(\rho^3\) for points 3 time units apart, and so on. This makes the correlation steadily weaker as the time gap grows larger. For example, if you had four time points, the \(\textbf{R}\) matrix would look like this:

\[\textbf{R}=\sigma^2\begin{pmatrix}1 & \rho & \rho^2 & \rho^3 \\ \rho & 1 & \rho & \rho^2 \\ \rho^2 & \rho & 1 & \rho \\ \rho^3 & \rho^2 & \rho & 1 \end{pmatrix}\]

Fit models

R

To fit models to this dataset, we are going to need to use a more advanced linear mixed modeling package, glmmTMB. This package allows us to fit more complex covariance structures to the data.

Here’s how we will set up the model formula.

Our response variable is two column vectors side-by-side: cbind(n_healthy, n_plants - n_healthy).
- The first value is the number of “successes” or healthy plants.
- The second value is the number of “failures” or the total number of plants minus the number of healthy plants.
Our fixed effects are fungicide, time, and their interaction: fungicide + time + fungicide:time
There is a random intercept term for each block: (1 | block)
Next, we want to account for the repeated measures of plots within blocks. We are going to try the three different ways we just discussed.
- unstructured covariance model with a separately estimated covariance parameter for each pair of time points.
- compound symmetry model, which fixes the covariance parameter for each pair of time points at a single value.
- first-order autoregressive or AR(1) model, which assumes correlation decreasing as time points get farther apart.

Note we use 0 + time to exclude the intercept from the residual covariance, only fitting it for the time slope.

fit_us <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + us(0 + time | fungicide:block), data = fungicide, family = binomial)

Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')

Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; false convergence (8). See vignette('troubleshooting'),
help('diagnose')

fit_cs <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + cs(0 + time | fungicide:block), data = fungicide, family = binomial)

fit_ar1 <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + ar1(0 + time | fungicide:block), data = fungicide, family = binomial)

As it turns out, the unstructured covariance model does not even converge. This is because it has too many parameters for the small dataset we are fitting the model to. When you get a warning like that, it means the model predictions aren’t valid. Resist the temptation to look at them!

The compound symmetry model and the AR1 model are a little bit less flexible but have fewer parameters. This is a good thing because it helps us avoid overfitting on our relatively small dataset. We do not get any warnings about model convergence with those simpler models.

Residual diagnostics and model comparison

Let’s check model diagnostics for the CS and AR1 models. Again we need to simulate residuals from the fitted model objects. The residual plots look great.

sim_resid_csmodel <- simulateResiduals(fit_cs)
plot(sim_resid_csmodel)

DHARMa simulated residual diagnostic plot for fungicide model

sim_resid_ar1model <- simulateResiduals(fit_ar1)
plot(sim_resid_ar1model)

DHARMa simulated residual diagnostic plot for fungicide model

Let’s compare the AIC (Akaike Information Criterion) of the CS and AR(1) models. AIC measures how well the model fits the data but assesses a penalty for having too many parameters. Lower means a better fit, accounting for potential overfitting.

AIC(fit_cs, fit_ar1)

        df      AIC
fit_cs  17 497.4627
fit_ar1 15 496.8801

We see that the two models have nearly identical AIC values. Thus, we will use the autoregressive model due to its greater simplicity, as indicated by the lower number of degrees of freedom. The difference is small enough that either one is probably acceptable.

Assess model predictions

Now, we can look at the results. There are a few different marginal means we can estimate. We can look at the effect of fungicide averaged across times as well as compare the effect of fungicide within each time point.

emm_fung <- emmeans(fit_ar1, pairwise ~ fungicide, type = 'response', adjust = 'tukey')

NOTE: Results may be misleading due to involvement in interactions

emm_fung_by_time <- emmeans(fit_ar1, pairwise ~ fungicide | time, type = 'response', adjust = 'tukey')

emm_fung

$emmeans
 fungicide  prob     SE  df asymp.LCL asymp.UCL
 None      0.734 0.0159 Inf     0.702     0.764
 DACD      0.789 0.0137 Inf     0.761     0.815
 RT        0.754 0.0152 Inf     0.723     0.782
 TSX       0.771 0.0145 Inf     0.741     0.798

Results are averaged over the levels of: time 
Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

$contrasts
 contrast    odds.ratio     SE  df null z.ratio p.value
 None / DACD      0.737 0.0555 Inf    1  -4.047  0.0003
 None / RT        0.904 0.0674 Inf    1  -1.356  0.5273
 None / TSX       0.823 0.0617 Inf    1  -2.594  0.0468
 DACD / RT        1.226 0.0926 Inf    1   2.694  0.0356
 DACD / TSX       1.117 0.0847 Inf    1   1.455  0.4649
 RT / TSX         0.911 0.0685 Inf    1  -1.239  0.6020

Results are averaged over the levels of: time 
P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log odds ratio scale

emm_fung_by_time

$emmeans
time = 12:
 fungicide  prob     SE  df asymp.LCL asymp.UCL
 None      0.779 0.0168 Inf     0.744     0.810
 DACD      0.819 0.0148 Inf     0.788     0.846
 RT        0.794 0.0161 Inf     0.761     0.824
 TSX       0.815 0.0150 Inf     0.784     0.843

time = 20:
 fungicide  prob     SE  df asymp.LCL asymp.UCL
 None      0.721 0.0192 Inf     0.682     0.757
 DACD      0.779 0.0168 Inf     0.745     0.810
 RT        0.735 0.0187 Inf     0.696     0.769
 TSX       0.758 0.0177 Inf     0.722     0.791

time = 42:
 fungicide  prob     SE  df asymp.LCL asymp.UCL
 None      0.699 0.0199 Inf     0.659     0.737
 DACD      0.767 0.0173 Inf     0.732     0.799
 RT        0.728 0.0189 Inf     0.690     0.764
 TSX       0.732 0.0187 Inf     0.694     0.767

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

$contrasts
time = 12:
 contrast    odds.ratio     SE  df null z.ratio p.value
 None / DACD      0.775 0.0836 Inf    1  -2.358  0.0853
 None / RT        0.912 0.0970 Inf    1  -0.871  0.8201
 None / TSX       0.798 0.0858 Inf    1  -2.103  0.1522
 DACD / RT        1.176 0.1280 Inf    1   1.490  0.4438
 DACD / TSX       1.029 0.1130 Inf    1   0.257  0.9941
 RT / TSX         0.875 0.0948 Inf    1  -1.233  0.6055

time = 20:
 contrast    odds.ratio     SE  df null z.ratio p.value
 None / DACD      0.733 0.0760 Inf    1  -2.994  0.0146
 None / RT        0.935 0.0954 Inf    1  -0.662  0.9114
 None / TSX       0.823 0.0847 Inf    1  -1.889  0.2327
 DACD / RT        1.275 0.1330 Inf    1   2.335  0.0903
 DACD / TSX       1.123 0.1180 Inf    1   1.109  0.6841
 RT / TSX         0.881 0.0909 Inf    1  -1.228  0.6090

time = 42:
 contrast    odds.ratio     SE  df null z.ratio p.value
 None / DACD      0.705 0.0724 Inf    1  -3.400  0.0038
 None / RT        0.867 0.0879 Inf    1  -1.412  0.4916
 None / TSX       0.850 0.0863 Inf    1  -1.600  0.3787
 DACD / RT        1.229 0.1270 Inf    1   1.990  0.1915
 DACD / TSX       1.205 0.1250 Inf    1   1.804  0.2712
 RT / TSX         0.981 0.1000 Inf    1  -0.187  0.9977

P value adjustment: tukey method for comparing a family of 4 estimates 
Tests are performed on the log odds ratio scale

The above results show that all of the fungicides’ effect on increasing the probability of healthy seedlings is modest at best. The marginal probability of healthy plants in the untreated control is 73.4%. The fungicides increase this by a few percentage points. Averaged across all three time points, we see that the odds ratio comparing DACD and TSX to the untreated control is significantly less than 1, indicating some level of effectiveness. However, we cannot reject the null hypothesis that RT has the same odds of healthy plants as the control.

Separating by time point, we can see that the estimated probability of healthy plants decreases over time. There is modest evidence for an interaction between fungicide and time. The DACD fungicide is generally the most effective, and RT is generally ineffective. The interaction with time is seen because the effectiveness of DACD fungicide relative to the untreated control decreases over time.

Let’s also make some plots to better visualize the model predictions. We will use the cld() function from the multcomp package to summarize the multiple comparisons with letters. We will annotate the means and 95% confidence interval error bars with the letters. The multiple comparisons are Tukey-adjusted. By default Sidak adjustment is made to the 95% confidence intervals, as you can tell from the note that displays when the code runs. That’s why the intervals shown on this plot are wider than the ones in the table above. You can print the CLD object to see the adjusted intervals.

The results averaged across time points and within each time point are combined into a single data frame to produce a multi-panel plot.

cld_fung <- cld(emm_fung$emmeans, adjust = 'tukey', Letters = letters)

Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons

cld_fung_by_time <- cld(emm_fung_by_time$emmeans, adjust = 'tukey', Letters = letters)

Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons

plot_data <- bind_rows(as_tibble(cld_fung), as_tibble(cld_fung_by_time)) %>%
  mutate(time = fct_na_value_to_level(time, 'overall') %>%
           factor(labels = c(paste(c(12, 20, 42), 'days after planting'), 'overall')))

ggplot(plot_data, aes(x = fungicide, y = prob, ymin = asymp.LCL, ymax = asymp.UCL, label = trimws(.group))) +
  geom_errorbar(linewidth = 0.9, width = 0.2) +
  geom_point(size = 3) +
  geom_text(aes(y  = asymp.UCL), vjust = -0.5, size = 5) +
  facet_wrap(~ time) +
  theme_bw() +
  scale_y_continuous(name = 'probability of healthy plant', limits = c(0.6, 0.9))

estimated marginal mean plot of fungicide by time

SAS

Similar models to the models fit above in R can be fit, and means comparisons evaluated, with the following code (output not shown). Because method = laplace, a maximum-likelihood method, is specified, AIC values are produced with the default output. The unstructured covariance model also fails to converge, just like we observed using R.

/* Unstructured covariance */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = un;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* Compound symmetry */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = cs;
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

/* AR1 */
proc glimmix data = fungicide method = laplace;
    class block fungicide time;
    model n_healthy/n_plants = fungicide time fungicide*time;
    random intercept / subject = block;
    random time / subject = fungicide(block) type = ar(1);
    lsmeans fungicide / ilink cl diff lines adjust = tukey;
    slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;

Conclusion

What have you learned in this lesson?

What GLMMs are and how link functions work
How to fit GLMMs to repeated measures binomial data
How to fit GLMMs with different error covariance structures
How to compare means from GLMMs and plot them

Nice job! Pat yourselves on the back once again!

Exercises

As in the previous lessons, here are some exercises for additional practice. Also as in the previous lessons, these exercises give you some SAS code and ask you to write some R code that does the same job or close to it. For this set of exercises, we will use another dataset from the agridat package: repeated measures of lettuce weights originally collected and analyzed by Levi Pederson for his MS thesis.

The lettuce plants were grown hydroponically and were randomly assigned to three different density treatments: treatment 1 had the lowest density, treatment 2 intermediate, and treatment 3 the highest. Plants were weighed twice a week for 40 days. More details can be found in the help documentation: ?pederson.lettuce.repeated.

Load the data like this:

library(agridat)
data('pederson.lettuce.repeated')

Here are trends of weight over time for the different plants. I’ve also plotted a line with the median weight by day for each treatment.

ggplot(pederson.lettuce.repeated, aes(x = day, y = weight, color = trt)) +
  geom_line(aes(group = plant), linewidth = .5, alpha = .5) +
  stat_summary(fun = 'median', geom = 'line', linewidth = 1.5) +
  theme_bw() +
  scale_color_okabeito()

Line charts of weight over time for each plant, colored by treatment, with medians

Here is a histogram of the weight data across all treatments and time points.

ggplot(pederson.lettuce.repeated, aes(x = weight)) + 
  geom_histogram() +
  theme_bw() +
  scale_y_continuous(expand = expansion(add = c(0, 1)))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Histogram of all lettuce weight data points

Exercise 1

Think about a model formula to analyze these data.

1. What is the response variable?
1. What should the fixed effect(s) in the model be?
1. How are the repeated measures grouped?

Exercise 2

Use glmmTMB() to fit a model to this dataset. Use compound symmetry covariance structure to account for the repeated measures.

In SAS, the class, model, and random statements for a similar model fit with PROC GLIMMIX would be:

class plant trt day;
model weight = trt day trt*day;
random day / subject=plant residual type=cs;

Hint: The random effects part of the model formula will be ...(0 + day | plant) but what should you put in place of the ...?

Exercise 3

Simulate residuals and plot the simulated residual diagnostics. What do you notice?

Hint: Use the function simulateResiduals().

Exercise 4

Fit the model again with family = 'lognormal' to account for the positive, right-skewed distribution of weights seen in the histogram. Simulate residuals again and plot. What do you notice now?

In SAS, the model statement would be:

model weight = trt day trt*day / dist=lognormal;

Exercise 5

Fit the model again with the lognormal response distribution and AR(1) covariance structure. Check residual diagnostics. Use AIC to compare this model with the compound symmetry lognormal model you fit in the previous exercise. What can you conclude?

In SAS, the model statement would be:

random day / subject=plant residual type=ar(1);

Hint: 0 + factor(day) must be in your random effects specification.

Exercise 6

Use the following code to estimate treatment means at day 1, day 21, and day 40, for the best model you identified so far. Replace the ... with the name of the fitted model object.

trt_means_by_day <- emmeans(..., ~ trt | day, at = list(day = c(1, 21, 40)), type = 'response')

Now use cld with Tukey adjustment to determine whether the difference between the means of any pair of treatments is significant at any of the time points. What can you conclude?

In SAS, the corresponding statement, which would produce means and comparisons for all days, would be:

slice trt*day / sliceby=day diff cl ilink linestable adjust=tukey;

Click here for answers

R for SAS users (lesson 3)

Quentin D. Read

March 28, 2025

Introduction

What is this workshop?

What will you learn from this workshop?

Conceptual learning objectives

Practical skills

How to follow along with this workshop

Cattle pneumonia example data analysis

Load R packages

The dataset

SAS

R

Exploratory plots

SAS

R

Fit linear model

Linear model residual diagnostics

Assess linear model predictions

Generalized linear mixed model

Fit binomial GLMM

SAS

R

GLMM residual diagnostics

Assess GLMM model predictions

Fungicide example data analysis

The dataset

SAS

R

Exploratory plots

R

SAS

Background: residual covariance structures

Fit models

R

Residual diagnostics and model comparison

Assess model predictions

SAS

Conclusion

Exercises

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6