Welcome back to the R for SAS users workshop! This workshop is intended for SAS users who want to learn R. The people who will get the most out of this course are practicing researchers who have a decent working knowledge of SAS, and of basic statistical analysis (descriptive stats and regression models) as it applies to their field.
This is lesson 3 of 3 in a series. Lesson 1 covered the basics: importing data, cleaning and reshaping data, summary statistics, simple graphs and tables, and a few simple statistical models. Lesson 2 got a little more advanced, covering linear mixed models for more sophisticated experimental designs and how to produce and compare estimated marginal means.
Download the worksheet for this lesson here.
IMPORTANT NOTE: In this lesson, the numerical results of the R and SAS code may no longer be identical, as they were in previous lessons. This is because different fitting algorithms are used by SAS PROC GLIMMIX and by the R model fitting packages that we are demonstrating. A full discussion of these differences is outside the scope of this lesson!
During this workshop, you will …
As in Lessons 1 and 2, we will work through a “data to doc” pipeline in R, comparing R code to SAS code each step of the way. We will use yet another different dataset.
We will …
...
with
code)As in the previous lessons, we will start with raw data and work our way to a finished product. Hopefully this is becoming second nature to you by now!
Here we’ll load the R packages we are going to work with today. These are mostly the same as the previous lessons. This includes the tidyverse package for reading, manipulating, and plotting data, the lme4 package for fitting linear mixed models, and the easystats package which has some good model diagnostic plots. Now we’re also using glmmTMB, a more advanced mixed model fitting package and DHARMa for GLMM model residual diagnostic plots. Set a default plotting theme as well.
library(tidyverse)
library(lme4)
library(easystats)
library(emmeans)
library(multcomp)
library(glmmTMB)
library(DHARMa)
theme_set(theme_bw())
The first dataset we will use for this lesson is the
cbpp
or contagious bovine pleuropneumonia dataset. It is
pre-loaded with the lme4 package. The number of
Ethiopian zebu cattle that developed the disease in each herd, and the
total number of cattle in the herd, is recorded, for each of four time
periods. The herds (1-15) are identified with numerical IDs, and the
time periods are identified by the integers 1-4. Note the
period
column is already a factor
variable
when you examine the pre-loaded dataset.
Zebu, photo by Scott Bauer, USDA-ARS
PROTIP: You can call
?cbpp
in the R console to see documentation for the dataset. This can be done for any example dataset that comes pre-loaded with a package.
Note I’ve made a copy of the cbpp
dataset as a CSV file
so that you can import the same dataset into SAS if you want to follow
along with the SAS code.
filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/cbpp.csv";
proc import datafile = csvFile out = cbpp replace dbms = csv; run;
data cbpp; set cbpp;
rate = incidence/size;
run;
data(cbpp)
glimpse(cbpp)
Rows: 56
Columns: 4
$ herd <fct> 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, …
$ incidence <dbl> 2, 3, 4, 0, 3, 1, 1, 8, 2, 0, 2, 2, 0, 2, 0, 5, 0, 0, 1, 3, …
$ size <dbl> 14, 12, 9, 5, 22, 18, 21, 22, 16, 16, 20, 10, 10, 9, 6, 18, …
$ period <fct> 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, …
Let’s make a plot of the CBPP data. We will plot the proportion
incidence/size
for each herd at each time period.
proc sgplot data = cbpp noautolegend;
styleattrs datacontrastcolors = (black) datalinepatterns = (solid);
series x = period y = rate / group = herd;
run;
ggplot(cbpp, aes(x = period, y = incidence/size, group = herd)) +
geom_line() +
theme_bw()
We can see from the above that the proportion of cows in each herd that show disease symptoms varies quite a lot between herds and over time.
As we’ve learned in the previous lessons, if we have non-independence in our measurements, we have to account for that using a linear mixed model. We definitely have that in this case because the same herds were sampled four times, so measurements from period 1-4 from the same herd are not independent of one another.
So here’s a linear mixed model fit using lmer()
(SAS
code not presented here). The response variable is the proportion of
cattle suffering from CBPP, incidence/size
, the fixed
effect is the time period treated as a discrete categorical variable,
period
, and there is a random intercept for each herd,
(1 | herd)
.
lmm_cbpp <- lmer(incidence/size ~ period + (1 | herd), data = cbpp)
So far so good, right?
The model diagnostic plots also look reasonably good for a small dataset.
check_model(lmm_cbpp)
Let’s get the model’s estimate of the mean and 95% confidence interval of proportion incidence of CBPP for each of the four periods.
emmeans(lmm_cbpp, ~ period)
period emmean SE df lower.CL upper.CL
1 0.2198 0.0323 52 0.1550 0.285
2 0.0741 0.0335 52 0.0069 0.141
3 0.0894 0.0335 52 0.0222 0.157
4 0.0413 0.0348 52 -0.0286 0.111
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
The estimated marginal means decrease over time from period 1 (0.22) to period 4 (0.04). But the 95% confidence limits indicate a problem. The lower 95% confidence limit on the mean disease incidence proportion is negative! It is not possible to have a proportion less than 0 or greater than 1. Our assumption of normally distributed error around the mean proportions isn’t valid.
This is where we need to use generalized linear mixed models (GLMMs). What GLMMs do is use a link function to essentially convert the response variable onto a scale where its error actually is close enough to normal. Then, from the model’s perspective, there’s no problem. We can treat it as if it were a linear mixed model, and then just undo the transformation afterward so that we can interpret the estimates of model means and confidence intervals on the original data scale.
Here, our response is binary (either a cow has the disease or it doesn’t). Modeling a binary response can be done with a generalized linear model. We use a link function to convert the predicted response value into a scale that can be normally distributed. What we are predicting is the probability that an individual cow from a given herd in a given time period will have CBPP. Probability ranges from 0 to 1, so it cannot really have normally distributed error because normal distributions (bell curves) can take any value, positive or negative. So a bounded range like 0 to 1 doesn’t really play well with normal distributions.
The formula for predicting the probability of disease for individual \(i\) at time period \(j\) in herd \(k\) is as follows:
\[\text{logit}~\hat{y}_{i,j,k} = \beta_0 + P_j + u_k\]
In this formula, \(\hat{y}_{i,j,k}\) is the predicted probability of disease for individual \(i\) at time period \(j\) in herd \(k\). \(\beta_0\) is the intercept, \(P_j\) is the fixed effect of time period \(j\), and \(u_k\) is the random effect, a separate intercept for each herd. You can see the model assumes that every individual in the same time period in the same herd has the same probability of having CBPP.
But what is that “logit” thing? That is the link function. It is used to transform the probability, which ranges from 0 to 1 and therefore can’t really have normally distributed error, to a scale that can take any value, positive or negative. Logit is another word for the log-odds function, defined as
\[\text{logit}~p = \log \frac {p}{1-p}\]
Here is a graph of that function:
You can see that it maps the values ranging from 0 to 1 to values ranging from negative to positive infinity. It is more or less a straight line when the probability is between about 0.25 and 0.75, but starts to get really steep as you get close to the boundaries. This makes sense because if you have an event that has an extremely low (or high) chance of happening, a little increase (or decrease) to the probability will be a relatively bigger difference.
To use statistical jargon, what we are going to do here is fit a binomial GLMM (binomial meaning our response variable is a binary outcome) with a logit link function to transform the predicted probabilities onto the log odds scale.
proc glimmix data = cbpp;
class herd period;
model incidence/size = period / dist = binomial;
random herd;
weight size;
lsmeans period / ilink cl diff oddsratio adjust = tukey;
run;
Model Information | |
---|---|
Data Set | WORK.CBPP |
Response Variable (Events) | incidence |
Response Variable (Trials) | size |
Response Distribution | Binomial |
Link Function | Logit |
Variance Function | Default |
Weight Variable | size |
Variance Matrix | Not blocked |
Estimation Technique | Residual PL |
Degrees of Freedom Method | Containment |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
herd | 15 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
period | 4 | 1 2 3 4 |
Number of Observations Read | 56 |
---|---|
Number of Observations Used | 56 |
Number of Events | 99 |
Number of Trials | 842 |
Dimensions | |
---|---|
G-side Cov. Parameters | 1 |
Columns in X | 5 |
Columns in Z | 15 |
Subjects (Blocks in V) | 1 |
Max Obs per Subject | 56 |
Optimization Information | |
---|---|
Optimization Technique | Dual Quasi-Newton |
Parameters in Optimization | 1 |
Lower Boundaries | 1 |
Upper Boundaries | 0 |
Fixed Effects | Profiled |
Starting From | Data |
Iteration History | |||||
---|---|---|---|---|---|
Iteration | Restarts | Subiterations |
Objective Function |
Change |
Max Gradient |
0 | 0 | 5 | 763.01311438 | 0.33363629 | 0.000019 |
1 | 0 | 3 | 927.32730767 | 0.09156786 | 1.469E-6 |
2 | 0 | 1 | 957.93604425 | 0.00294224 | 9.108E-6 |
3 | 0 | 1 | 958.76611047 | 0.00000413 | 2.345E-8 |
4 | 0 | 0 | 958.76678537 | 0.00000000 | 4.042E-8 |
Convergence criterion (PCONV=1.11022E-8) satisfied. |
Fit Statistics | |
---|---|
-2 Res Log Pseudo-Likelihood | 958.77 |
Generalized Chi-Square | 926.53 |
Gener. Chi-Square / DF | 17.82 |
Covariance Parameter Estimates | ||
---|---|---|
Cov Parm | Estimate |
Standard Error |
herd | 0.7536 | 0.2899 |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
period | 3 | 38 | 75.66 | <.0001 |
period Least Squares Means | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
period | Estimate |
Standard Error |
DF | t Value | Pr > |t| | Alpha | Lower | Upper | Mean |
Standard Error Mean |
Lower Mean |
Upper Mean |
1 | -1.5248 | 0.2283 | 38 | -6.68 | <.0001 | 0.05 | -1.9870 | -1.0626 | 0.1788 | 0.03352 | 0.1206 | 0.2568 |
2 | -2.1217 | 0.2331 | 38 | -9.10 | <.0001 | 0.05 | -2.5935 | -1.6498 | 0.1070 | 0.02227 | 0.06956 | 0.1611 |
3 | -2.5721 | 0.2382 | 38 | -10.80 | <.0001 | 0.05 | -3.0543 | -2.0899 | 0.07096 | 0.01570 | 0.04503 | 0.1101 |
4 | -2.6697 | 0.2441 | 38 | -10.94 | <.0001 | 0.05 | -3.1640 | -2.1755 | 0.06478 | 0.01479 | 0.04054 | 0.1020 |
Differences of period Least Squares Means Adjustment for Multiple Comparisons: Tukey-Kramer |
|||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
period | _period | Estimate | Standard Error | DF | t Value | Pr > |t| | Adj P | Alpha | Lower | Upper | Adj Lower | Adj Upper | Odds Ratio | Lower Confidence Limit for Odds Ratio | Upper Confidence Limit for Odds Ratio |
Adj Lower Odds Ratio |
Adj Upper Odds Ratio |
1 | 2 | 0.5969 | 0.07424 | 38 | 8.04 | <.0001 | <.0001 | 0.05 | 0.4466 | 0.7472 | 0.3974 | 0.7963 | 1.816 | 1.563 | 2.111 | 1.488 | 2.217 |
1 | 3 | 1.0473 | 0.08898 | 38 | 11.77 | <.0001 | <.0001 | 0.05 | 0.8672 | 1.2274 | 0.8083 | 1.2864 | 2.850 | 2.380 | 3.412 | 2.244 | 3.620 |
1 | 4 | 1.1450 | 0.1032 | 38 | 11.09 | <.0001 | <.0001 | 0.05 | 0.9360 | 1.3539 | 0.8677 | 1.4222 | 3.142 | 2.550 | 3.872 | 2.381 | 4.146 |
2 | 3 | 0.4504 | 0.09755 | 38 | 4.62 | <.0001 | 0.0002 | 0.05 | 0.2530 | 0.6479 | 0.1884 | 0.7125 | 1.569 | 1.288 | 1.912 | 1.207 | 2.039 |
2 | 4 | 0.5481 | 0.1111 | 38 | 4.93 | <.0001 | <.0001 | 0.05 | 0.3231 | 0.7730 | 0.2496 | 0.8466 | 1.730 | 1.381 | 2.166 | 1.283 | 2.332 |
3 | 4 | 0.09764 | 0.1219 | 38 | 0.80 | 0.4281 | 0.8535 | 0.05 | -0.1491 | 0.3444 | -0.2299 | 0.4251 | 1.103 | 0.861 | 1.411 | 0.795 | 1.530 |
In R, we use the glmer()
function instead of
lmer()
in this case. We specify a formula in the same way
as we have been doing with lmer()
, and the data frame from
which the variables come, but we now have a new argument,
family
. This refers to the “family” of response
distributions that we are going to be using. Here we specify
binomial(link = 'logit')
. Other link functions are possible
to use with binomial GLMMs, but we will not get into that today.
Also notice weights = size
which accounts for the fact
that the herds are different sizes; a bigger herd is contributing more
data to our estimate of the incidence of CBPP over time.
glmm_cbpp <- glmer(incidence/size ~ period + (1 | herd), weights = size, data = cbpp, family = binomial(link = 'logit'))
To check this model, it no longer makes sense to look at the residual
diagnostic plots from check_model()
. We aren’t going to get
normally distributed residuals. However, there are still issues we
potentially need to check for. The R package DHARMa
allows you to simulate residuals from a fitted GLMM object and test them
for a few different things. Here we see a normal Q-Q plot of the
simulated residuals which shows no issues with deviation from normality
(K-S test), overdispersion, or outliers. The homogeneity of variance
plot on the right also shows no difference in the variance of the
residuals between the four treatments.
sim_resid_cbpp <- simulateResiduals(glmm_cbpp)
plot(sim_resid_cbpp)
Now we can look at the estimated marginal means of CBPP probability
for each of the time periods. Notice we use the argument
type = 'response'
. This uses the inverse link function to
back-transform the mean estimates and the endpoints of the 95%
confidence intervals from the log-odds scale to the probability scale.
This is much easier to interpret. In SAS the corresponding option is
ilink
(inverse link) in lsmeans
and
slice
statements.
emmeans(glmm_cbpp, pairwise ~ period, type = 'response', adjust = 'tukey')
$emmeans
period prob SE df asymp.LCL asymp.UCL
1 0.1981 0.0367 Inf 0.1357 0.280
2 0.0839 0.0236 Inf 0.0478 0.143
3 0.0740 0.0224 Inf 0.0404 0.132
4 0.0484 0.0196 Inf 0.0216 0.105
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
$contrasts
contrast odds.ratio SE df null z.ratio p.value
period1 / period2 2.70 0.817 Inf 1 3.272 0.0059
period1 / period3 3.09 0.998 Inf 1 3.495 0.0027
period1 / period4 4.85 2.049 Inf 1 3.743 0.0010
period2 / period3 1.15 0.431 Inf 1 0.363 0.9837
period2 / period4 1.80 0.835 Inf 1 1.267 0.5843
period3 / period4 1.57 0.751 Inf 1 0.945 0.7807
P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log odds ratio scale
We see that period 1 has an estimated marginal mean of 19.8% disease incidence while the estimated mean incidences for the other three periods are much lower. As a result, the odds ratio is significantly greater than 1 when comparing period 1 to any of the other three periods, but among periods 2, 3, and 4 the odds ratios are not significantly different from 1.
In the end we got similar estimates of the mean incidence by period, but the 95% confidence intervals are now all positive. With the GLMM, we no longer get nonsensical negative probabilities.
The following example dataset was published in the useful book Analysis of Generalized Linear Mixed models in the Agricultural and Natural Resource Sciences, co-written by some of my ARS statistician colleagues. The SAS code presented in this section is adapted from the book.
Diseased cotton seedlings, photo by University of Missouri
Extension
The example dataset comes from a study by C. S. Rothrock where three
different fungicides were applied to cotton plants to test how
effectively they reduced seedling diseases, and whether the reduction
persisted over time. There are four treatments (three fungicides labeled
DACD
, RT
, and TSX
and untreated
control labeled None
) applied in a randomized complete
block design (RCBD) with five blocks. There were two rows per plot but
we will not consider the multiple rows in our analysis. In each plot
there were a total of 400 plants. At three time points (12, 20, and 42
days after planting) the number of diseased plants out of 400 was
recorded for each plot.
Because the same plot was measured three times, we cannot consider those three measurements to be independent data points. We must model the relationship between them. You may see this called “repeated measures” or “split plot in time.”
NOTE: SAS output is not shown for the code presented in this section. Feel free to try out the SAS code on your own!
filename csvFile2 url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv";
proc import datafile = csvFile2 out = fungicide replace dbms = csv; run;
Import the data.
If you are running this on the cloud server:
fungicide <- read_csv('data/fungicide.csv')
Rows: 60 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): fungicide
dbl (7): block, plot, n_plants, time, n_healthy_row1, n_healthy_row2, n_healthy
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you are running this on your own machine:
fungicide <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fungicide.csv')
Change the appropriate columns to factors. Set the untreated control,
'None'
, to the reference level for fungicide.
fungicide <- fungicide %>%
mutate(
block = factor(block),
plot = factor(plot),
time = factor(time),
fungicide = relevel(factor(fungicide), ref = 'None')
)
Let’s take a look at the data. First let’s make a plot with a point
for each of the raw data values. In addition, a semi-transparent
horizontal bar is placed at the median for each treatment at each time
point; we use stat_summary()
to calculate a summary
statistic and specify the shape of the point. We use
position_dodge()
to horizontally separate the points for
each treatment. (Inspiration for this plot comes from this post about
ggplot2.) We can see from the plot that the number of healthy plants
decreases over time. The no-fungicide control tends to have fewer
healthy plants than the three treated groups, but no big difference
between the three fungicides jumps out at you.
PROTIP: A colorblind-friendly color scheme is used here because the default
ggplot2
colors are not easily distinguishable by red-green colorblind people.
ggplot(fungicide, aes(x = time, color = fungicide, group = fungicide, y = n_healthy)) +
geom_point(size = 2, position = position_dodge(width = .5)) +
stat_summary(fun = 'median', geom = 'point', shape = 95, size = 10, position = position_dodge(width = .5), alpha = .5) +
theme_bw() +
scale_color_okabeito(order = c(9, 1, 2, 3))
To visualize whether there is an effect of block, plot the same data
in a different way. Now, color indicates block instead of treatment.
Fungicide treatments are split into different panels with
facet_wrap()
. Also, connect the same plot observed over
time with lines to help visualize. From this plot, it looks like there
is some effect of block. Block 5 tends to have more healthy plants at
all time points regardless of treatment. Blocks 1 and 2 have a little
bit fewer. Any model that we fit is going to have to account for this
block-level difference. It will make it easier for us to pull out the
overall treatment effect while accounting for random variation among the
blocks.
ggplot(fungicide, aes(x = time, color = block, y = n_healthy)) +
geom_line(aes(group = block)) +
geom_point(size = 2) +
facet_wrap(~ fungicide) +
theme_bw() +
scale_color_okabeito()
Similar plots to the ones created above in R can be made with the following SAS code (output not shown).
proc sgplot data = fungicide;
vbox n_healthy / category=time group=fungicide groupdisplay=cluster;
run;
proc sgpanel data = fungicide;
panelby fungicide;
vline time / response=n_healthy group=block;
run;
To fit models to this dataset, we are going to need to use a more advanced linear mixed modeling package, glmmTMB. This package allows us to fit more complex covariance structures to the data.
Here’s how we will set up the model formula.
cbind(n_healthy, n_plants - n_healthy)
.
fungicide + time + fungicide:time
(1 | block)
Note we use 0 + time
to exclude the intercept from the
residual covariance, only fitting it for the time slope.
fit_us <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + us(0 + time | fungicide:block), data = fungicide, family = binomial)
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
Warning in finalizeTMB(TMBStruc, obj, fit, h, data.tmb.old): Model convergence
problem; false convergence (8). See vignette('troubleshooting'),
help('diagnose')
fit_cs <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + cs(0 + time | fungicide:block), data = fungicide, family = binomial)
fit_ar1 <- glmmTMB(cbind(n_healthy, n_plants - n_healthy) ~ fungicide + time + fungicide:time + (1 | block) + ar1(0 + time | fungicide:block), data = fungicide, family = binomial)
As it turns out, the unstructured covariance model does not even converge. This is because it has too many parameters for the small dataset we are fitting the model to. When you get a warning like that, it means the model predictions aren’t valid. Resist the temptation to look at them!
The compound symmetry model and the AR1 model are a little bit less flexible but have fewer parameters. This is a good thing because it helps us avoid overfitting on our relatively small dataset. We do not get any warnings about model convergence with those simpler models.
Let’s check model diagnostics for the CS and AR1 models. Again we need to simulate residuals from the fitted model objects. The residual plots look great.
sim_resid_csmodel <- simulateResiduals(fit_cs)
plot(sim_resid_csmodel)
sim_resid_ar1model <- simulateResiduals(fit_ar1)
plot(sim_resid_ar1model)
Let’s compare the AIC (Akaike Information Criterion) of the CS and AR(1) models. AIC measures how well the model fits the data but assesses a penalty for having too many parameters. Lower means a better fit, accounting for potential overfitting.
AIC(fit_cs, fit_ar1)
df AIC
fit_cs 17 497.4627
fit_ar1 15 496.8801
We see that the two models have nearly identical AIC values. Thus, we will use the autoregressive model due to its greater simplicity, as indicated by the lower number of degrees of freedom. The difference is small enough that either one is probably acceptable.
Now, we can look at the results. There are a few different marginal means we can estimate. We can look at the effect of fungicide averaged across times as well as compare the effect of fungicide within each time point.
emm_fung <- emmeans(fit_ar1, pairwise ~ fungicide, type = 'response', adjust = 'tukey')
NOTE: Results may be misleading due to involvement in interactions
emm_fung_by_time <- emmeans(fit_ar1, pairwise ~ fungicide | time, type = 'response', adjust = 'tukey')
emm_fung
$emmeans
fungicide prob SE df asymp.LCL asymp.UCL
None 0.734 0.0159 Inf 0.702 0.764
DACD 0.789 0.0137 Inf 0.761 0.815
RT 0.754 0.0152 Inf 0.723 0.782
TSX 0.771 0.0145 Inf 0.741 0.798
Results are averaged over the levels of: time
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
$contrasts
contrast odds.ratio SE df null z.ratio p.value
None / DACD 0.737 0.0555 Inf 1 -4.047 0.0003
None / RT 0.904 0.0674 Inf 1 -1.356 0.5273
None / TSX 0.823 0.0617 Inf 1 -2.594 0.0468
DACD / RT 1.226 0.0926 Inf 1 2.694 0.0356
DACD / TSX 1.117 0.0847 Inf 1 1.455 0.4649
RT / TSX 0.911 0.0685 Inf 1 -1.239 0.6020
Results are averaged over the levels of: time
P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log odds ratio scale
emm_fung_by_time
$emmeans
time = 12:
fungicide prob SE df asymp.LCL asymp.UCL
None 0.779 0.0168 Inf 0.744 0.810
DACD 0.819 0.0148 Inf 0.788 0.846
RT 0.794 0.0161 Inf 0.761 0.824
TSX 0.815 0.0150 Inf 0.784 0.843
time = 20:
fungicide prob SE df asymp.LCL asymp.UCL
None 0.721 0.0192 Inf 0.682 0.757
DACD 0.779 0.0168 Inf 0.745 0.810
RT 0.735 0.0187 Inf 0.696 0.769
TSX 0.758 0.0177 Inf 0.722 0.791
time = 42:
fungicide prob SE df asymp.LCL asymp.UCL
None 0.699 0.0199 Inf 0.659 0.737
DACD 0.767 0.0173 Inf 0.732 0.799
RT 0.728 0.0189 Inf 0.690 0.764
TSX 0.732 0.0187 Inf 0.694 0.767
Confidence level used: 0.95
Intervals are back-transformed from the logit scale
$contrasts
time = 12:
contrast odds.ratio SE df null z.ratio p.value
None / DACD 0.775 0.0836 Inf 1 -2.358 0.0853
None / RT 0.912 0.0970 Inf 1 -0.871 0.8201
None / TSX 0.798 0.0858 Inf 1 -2.103 0.1522
DACD / RT 1.176 0.1277 Inf 1 1.490 0.4438
DACD / TSX 1.029 0.1129 Inf 1 0.257 0.9941
RT / TSX 0.875 0.0948 Inf 1 -1.233 0.6055
time = 20:
contrast odds.ratio SE df null z.ratio p.value
None / DACD 0.733 0.0760 Inf 1 -2.994 0.0146
None / RT 0.935 0.0954 Inf 1 -0.662 0.9114
None / TSX 0.823 0.0847 Inf 1 -1.889 0.2327
DACD / RT 1.275 0.1327 Inf 1 2.335 0.0903
DACD / TSX 1.123 0.1178 Inf 1 1.109 0.6841
RT / TSX 0.881 0.0909 Inf 1 -1.228 0.6090
time = 42:
contrast odds.ratio SE df null z.ratio p.value
None / DACD 0.705 0.0724 Inf 1 -3.400 0.0038
None / RT 0.867 0.0879 Inf 1 -1.412 0.4916
None / TSX 0.850 0.0863 Inf 1 -1.600 0.3787
DACD / RT 1.229 0.1271 Inf 1 1.990 0.1915
DACD / TSX 1.205 0.1248 Inf 1 1.804 0.2712
RT / TSX 0.981 0.1003 Inf 1 -0.187 0.9977
P value adjustment: tukey method for comparing a family of 4 estimates
Tests are performed on the log odds ratio scale
The above results show that all of the fungicides’ effect on increasing the probability of healthy seedlings is modest at best. The marginal probability of healthy plants in the untreated control is 73.4%. The fungicides increase this by a few percentage points. Averaged across all three time points, we see that the odds ratio comparing DACD and TSX to the untreated control is significantly less than 1, indicating some level of effectiveness. However, we cannot reject the null hypothesis that RT has the same odds of healthy plants as the control.
Separating by time point, we can see that the estimated probability of healthy plants decreases over time. There is modest evidence for an interaction between fungicide and time. The DACD fungicide is generally the most effective, and RT is generally ineffective. The interaction with time is seen because the effectiveness of DACD fungicide relative to the untreated control decreases over time.
Let’s also make some plots to better visualize the model predictions.
We will use the cld()
function from the
multcomp package to summarize the multiple comparisons
with letters. We will annotate the means and 95% confidence interval
error bars with the letters. The multiple comparisons are
Tukey-adjusted. By default Sidak adjustment is made to the 95%
confidence intervals, as you can tell from the note that displays when
the code runs. That’s why the intervals shown on this plot are wider
than the ones in the table above. You can print the CLD object to see
the adjusted intervals.
The results averaged across time points and within each time point are combined into a single data frame to produce a multi-panel plot.
cld_fung <- cld(emm_fung$emmeans, adjust = 'tukey', Letters = letters)
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
cld_fung_by_time <- cld(emm_fung_by_time$emmeans, adjust = 'tukey', Letters = letters)
Note: adjust = "tukey" was changed to "sidak"
because "tukey" is only appropriate for one set of pairwise comparisons
plot_data <- bind_rows(as_tibble(cld_fung), as_tibble(cld_fung_by_time)) %>%
mutate(time = fct_na_value_to_level(time, 'overall') %>%
factor(labels = c(paste(c(12, 20, 42), 'days after planting'), 'overall')))
ggplot(plot_data, aes(x = fungicide, y = prob, ymin = asymp.LCL, ymax = asymp.UCL, label = trimws(.group))) +
geom_errorbar(linewidth = 0.9, width = 0.2) +
geom_point(size = 3) +
geom_text(aes(y = asymp.UCL), vjust = -0.5, size = 5) +
facet_wrap(~ time) +
theme_bw() +
scale_y_continuous(name = 'probability of healthy plant', limits = c(0.6, 0.9))
Similar models to the models fit above in R can be fit, and means
comparisons evaluated, with the following code (output not shown).
Because method = laplace
, a maximum-likelihood method, is
specified, AIC values are produced with the default output. The
unstructured covariance model also fails to converge, just like we
observed using R.
/* Unstructured covariance */
proc glimmix data = fungicide method = laplace;
class block fungicide time;
model n_healthy/n_plants = fungicide time fungicide*time;
random intercept / subject = block;
random time / subject = fungicide(block) type = un;
lsmeans fungicide / ilink cl diff lines adjust = tukey;
slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;
/* Compound symmetry */
proc glimmix data = fungicide method = laplace;
class block fungicide time;
model n_healthy/n_plants = fungicide time fungicide*time;
random intercept / subject = block;
random time / subject = fungicide(block) type = cs;
lsmeans fungicide / ilink cl diff lines adjust = tukey;
slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;
/* AR1 */
proc glimmix data = fungicide method = laplace;
class block fungicide time;
model n_healthy/n_plants = fungicide time fungicide*time;
random intercept / subject = block;
random time / subject = fungicide(block) type = ar(1);
lsmeans fungicide / ilink cl diff lines adjust = tukey;
slice fungicide*time / sliceby = time ilink cl diff lines adjust = tukey;
run;
What have you learned in this lesson?
Nice job! Pat yourselves on the back once again!
As in the previous lessons, here are some exercises for additional practice. Also as in the previous lessons, these exercises give you some SAS code and ask you to write some R code that does the same job or close to it. For this set of exercises, we will use another dataset from the agridat package: repeated measures of lettuce weights originally collected and analyzed by Levi Pederson for his MS thesis.
The lettuce plants were grown hydroponically and were randomly
assigned to three different density treatments: treatment 1 had the
lowest density, treatment 2 intermediate, and treatment 3 the highest.
Plants were weighed twice a week for 40 days. More details can be found
in the help documentation: ?pederson.lettuce.repeated
.
Load the data like this:
library(agridat)
data('pederson.lettuce.repeated')
Here are trends of weight over time for the different plants. I’ve also plotted a line with the median weight by day for each treatment.
ggplot(pederson.lettuce.repeated, aes(x = day, y = weight, color = trt)) +
geom_line(aes(group = plant), linewidth = .5, alpha = .5) +
stat_summary(fun = 'median', geom = 'line', linewidth = 1.5) +
theme_bw() +
scale_color_okabeito()
Here is a histogram of the weight data across all treatments and time points.
ggplot(pederson.lettuce.repeated, aes(x = weight)) +
geom_histogram() +
theme_bw() +
scale_y_continuous(expand = expansion(add = c(0, 1)))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Think about a model formula to analyze these data.
Use glmmTMB()
to fit a model to this dataset. Use
compound symmetry covariance structure to account for the repeated
measures.
In SAS, the class
, model
, and
random
statements for a similar model fit with
PROC GLIMMIX
would be:
class plant trt day;
model weight = trt day trt*day;
random day / subject=plant residual type=cs;
...(0 + day | plant)
but what should
you put in place of the ...
?Simulate residuals and plot the simulated residual diagnostics. What do you notice?
simulateResiduals()
.Fit the model again with family = 'lognormal'
to account
for the positive, right-skewed distribution of weights seen in the
histogram. Simulate residuals again and plot. What do you notice
now?
In SAS, the model statement would be:
model weight = trt day trt*day / dist=lognormal;
Fit the model again with the lognormal response distribution and AR(1) covariance structure. Check residual diagnostics. Use AIC to compare this model with the compound symmetry lognormal model you fit in the previous exercise. What can you conclude?
In SAS, the model statement would be:
random day / subject=plant residual type=ar(1);
0 + factor(day)
must be in your random effects specification.Use the following code to estimate treatment means at day 1, day 21,
and day 40, for the best model you identified so far. Replace the
...
with the name of the fitted model object.
trt_means_by_day <- emmeans(..., ~ trt | day, at = list(day = c(1, 21, 40)), type = 'response')
Now use cld
with Tukey adjustment to determine whether
the difference between the means of any pair of treatments is
significant at any of the time points. What can you conclude?
In SAS, the corresponding statement, which would produce means and comparisons for all days, would be:
slice trt*day / sliceby=day diff cl ilink linestable adjust=tukey;