R for SAS users: Lesson 2

What is this workshop?

For SAS users who want to learn R
Practicing researchers who have a decent working knowledge of SAS and of basic statistical analysis
SAS code is provided for comparison purposes
Lesson 2 in a series

Conceptual learning objectives

During this workshop, you will …

Review the steps of the “data to doc” pipeline we covered in Lesson 1
Learn how to fit linear mixed models to different experimental designs
Learn about estimated marginal means, otherwise known as least-square means

Practical skills

We will …

Import the data from a CSV file
Clean and reshape the data
Calculate some summary statistics and make some exploratory plots
Fit a linear mixed-effects model with categorical fixed effects
Fit and compare linear mixed-effects models with random intercepts and random slopes
Make plots and tables of results

How to follow along with this workshop

Slides and text version of lessons are online
- usda-ree-ars.github.io/SEAStats
Fill in R code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind

Load R packages

lme4 is a modern package for mixed model fitting, limited to simpler models
For more complex models, alternatives like glmmTMB and brms (Bayesian) are available

library(tidyverse)
library(lme4)
library(easystats)
library(lmerTest)
library(emmeans)
library(multcomp)

theme_set(theme_bw())

Background on dataset

Fava beans!
Provided by Andrea Onofri, an R blogger

fava bean plant with flowers
Image by Josep Gesti

Compare performance (yield) of six fava bean genotypes
Two sowing treatments: spring sowing and autumn sowing

fava beans (ful medames) with hummus
Image by News from Jerusalem

Experimental design

Split-plot design
Sowing time (2 levels, spring and autumn) is main plot factor
Genotype (6 levels) is the subplot factor
Four replicated blocks
Entire experiment replicated in two locations (Badiola & Papiano)
Repeated in both locations for three seasons (2001-02, 2002-03, 2003-04)

Repeated split plot design

Main plot factor: sowing treatment (spring vs. autumn)

Subplot factor: genotype (6 genotypes in this study)
- The main sowing plots are split into six subplots, one for each genotype
- Note randomization of genotype position

Each block, or “rep,” is replicated four times
- Note randomization of both genotype and sowing position

The whole setup is repeated at two different locations

At each of the two locations, the experiment is repeated for three years
We’re considering location-year combination to be unique “environments”
Replication within environment and repetition at multiple environments are both important!

Import the data from a file

SAS

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv";
proc import datafile = csvFile out = fava replace dbms = csv; run;

R

If you are using cloud server:

fava <- read_csv('data/fava.csv')

If you are running code locally:

fava <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv')

Examine the data

SAS

proc print data = fava(obs = 10); run;

ods select variables;
proc contents data = fava; run;

R

fava

glimpse(fava)

Reminder that you can just type the name of the dataframe into the console to see the first few rows and some summary info

Pre-processing: SAS

Create new Environment column, combining Location and Year
- In SAS this is done with the data step

data fava; set fava;
    Environment = catx('_', Location, Year);
run;

Pre-processing: R

Convert Location, Year, Sowing, Genotype, and Block columns to factors in R
- In SAS this will be done with the class statement in the modeling proc
Also create Environment column as in SAS
mutate(): Transform existing columns or create new ones
Use %>% operator for better code readability
across(): Apply same function to many columns, can use : shorthand
Assign the result back to fava

fava <- fava %>%
  mutate(
    across(Location:Block, factor),
    Environment = interaction(Location, Year)
  )

Exploratory plots

SAS proc sgplot and proc sgpanel with vbox statement
R ggplot() with geom_boxplot(), optionally facet_grid() to make panels

SAS

proc sgplot data = fava;
    vbox Yield / group = Genotype;
run; 

proc sgplot data = fava;
    vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;
    
proc sgpanel data = fava;
    panelby Environment;
    vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;

R

ggplot(fava, aes(x = Genotype, y = Yield)) + 
  geom_boxplot()

ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) + 
  geom_boxplot()

ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) + 
  geom_boxplot() +
  facet_grid(Location ~ Year)

If everything is lumped together, not much difference between genotypes
If you split by sowing time, you can see a treatment effect
Things get messy when you also split by environment
- sowing time effect is reduced in 2003-04
- some genotype x environment interaction
Important to repeat experiments … and to visualize your data before analyzing!

Fit a model to one environment

Start simple and gradually build up to the final model we want
Let’s start by fitting a model to the data from one environment
Even simpler: first only fit the block model
- Only random effects, no fixed/treatment effects

Fit a model to one environment: SAS

Subset to one environment with data step
Use proc glimmix for mixed model fitting
My SAS code shows only some of the output

data papiano0203; set fava;
    where Location = 'papiano' & Year = '2002-2003';
run;

proc glimmix data = papiano0203;
    class Environment Block Sowing Genotype;
    model Yield = / solution;
    random Block Sowing(Block) / solution;
  ods select covparms parameterestimates tests3 solutionr;
run;

Fit a model to one environment: R

Yield ~ 1 + (1 | Block) + (1 | Sowing:Block)

The dependent or response variable (Yield) is on the left side of the formula
A tilde ~ separates the dependent from the independent variables
Independent variables, including both fixed and random terms, are separated with +
Here the only fixed effect is the global intercept (1)
Random effects terms have a design side (on the left hand) and group side (on the right hand) separated by |.
- 1 on the design side means only fit random intercepts and no random slopes
- (1 | Block): each block will get its own random intercept
- (1 | Sowing:Block): each main plot (unique combinations of Sowing and Block) will get its own random intercept

data argument specifies the data frame that has the variables in the formula
subset argument selects rows of the data frame to fit the model to
- Multiple conditions separated with &

fit_1env <- lmer(Yield ~ 1 + (1 | Block) + (1 | Sowing:Block), 
                 data = fava, 
                 subset = Location == 'papiano' & Year == '2002-2003')

Examine model output in R

summary(fit_1env)
ranef(fit_1env)
icc(fit_1env, by_group = TRUE)

summary(): residual summary, random effect variance/covariance, fixed effect estimates
- Here random effects are block (Block), main plot (Sowing:Block), and subplot (Residual because each data point is a subplot)
- Fixed effect is the global intercept
ranef() shows random effects, like random / solution in SAS
- Some main plots have higher yield than others but little difference between blocks
icc() shows intraclass correlation coefficient
- ~80% of variation at this location is among the main plots

Model with fixed effects: SAS

proc glimmix data = papiano0203;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Block Sowing(Block) / solution;
  ods select covparms parameterestimates tests3 solutionr;
run;

Model with fixed effects: R

Fixed effects: Genotype + Sowing + Genotype:Sowing
- : indicates an interaction
Or use shorthand Genotype * Sowing
- * indicates all combinations of main effects and interaction terms
- Different symbols than SAS!

fit_1env_fixef <- lmer(Yield ~ Genotype * Sowing + (1 | Block) + (1 | Sowing:Block), 
                       data = fava, 
                       subset = Location == 'papiano' & Year == '2002-2003')

summary(fit_1env_fixef)
ranef(fit_1env_fixef)
icc(fit_1env_fixef, by_group = TRUE)

Overall intercept and 5 coefficients for genotype (the first of 6 is the reference level)
1 coefficient for sowing treatment (autumn is reference level)
5 interaction coefficients
We can safely ignore correlation between fixed effects because they were experimentally randomized

ICCs are much lower: we’ve added fixed effects that explain a lot of the variation
Now that sowing (main plot) is included, most random variation is by block

If you really want the ANOVA table, use anova()
- only if you fit the model with lmerTest package loaded
- set method of approximating degrees of freedom: anova(fit, ddf = 'Kenward-Roger')

Fit model to all environments

Let’s fit a model to the full dataset to see if effects vary across environments
Should environment be fixed or random? Answer: It depends.
- Here we will treat it as random
- We treat environments as samples from a population of potential environments
- Here we care more about overall effects, not comparing mean of environment A to mean of environment B
We need more random intercept terms now
- environment (combination of location and year)
- block nested within environment
- main plot nested within block nested within environment

Fit model to all environments: SAS

proc glimmix data = fava;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Environment Block(Environment) Sowing(Block Environment) / solution;
  ods select parameterestimates tests3;
run;

Fit model to all environments: R

fit_allenv <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Block:Environment) + (1 | Sowing:Block:Environment), 
                   data = fava)

Random intercepts: Environment, Block:Environment, Sowing:Block:Environment (main plot)
We get a singular fit message!

Singular fit

What does that mean?!?!?
At least one variance or covariance of the random effects, or linear combinations of random effects, is exactly equal to zero or one
OK, what does that mean in English?
The algorithm that fits the model parameters doesn’t have enough data to get a good estimate
- The model may be too complex for the data we have
- Or some of the random effects may be very small and can’t be distinguished from zero
- Some random effects, or combinations of them, may be perfectly correlated with each other

What can you do about a singular fit?

It often occurs when trying to fit a model with complex multilevel nested structure to a fairly small dataset
- That’s what we’re trying to do here!
Look at the variance-covariance matrix to see which terms are zero or perfectly correlated with others
Refit the model with fewer random effects terms
Or you can just go ahead with the singular fit model

VarCorr(fit_allenv)

Blocks within each environment have zero variance (block effect is so small it can’t be estimated)

Refit model without block random intercept

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Environment Sowing(Block Environment) / solution;
  ods select parameterestimates tests3;
run;

R

fit_allenv_noblock <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Sowing:Block:Environment), 
                           data = fava)

This deals with the singular fit message!

Check model diagnostics

In SAS plots = residualpanel option within proc glimmix gives you diagnostics
In R we will use check_model() from the easystats package again

check_model(fit_allenv_noblock, check = c('linearity', 'homogeneity', 'qq'))

Diagnostic plots look great (no trends, straight Q-Q line)

Fit models with random slope terms

Random intercepts: “baseline” may vary by environment but same effect size of treatment across environments
Random slopes: allow one or more treatment effect to vary by environment
you could also have random slopes by block or main plot but we won’t do that here
Add additional terms to the design (left) side of the random effect formula in lme4 syntax
- (1 + Genotype + Sowing | Environment)
- Separate intercept (1), genotype effect, and sowing effect for each environment

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random intercept Genotype Sowing / subject = Environment;
    random intercept / subject = Sowing(Block Environment);
  ods select parameterestimates tests3;
run;

R

fit_allenv_allrandomslopes <- lmer(Yield ~ Genotype * Sowing + (1 + Genotype + Sowing | Environment) + (1 | Sowing:Block:Environment), 
                                   data = fava)

Singular fit again!

VarCorr(fit_allenv_allrandomslopes)

No individual effects are zero but some linear combinations of random effects may be equivalent

Simpler random slopes model

Get rid of the random slope for genotype, keep the random slope for sowing.

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random intercept Sowing / subject = Environment solution;
    random intercept / subject = Sowing(Block Environment);
  ods select parameterestimates tests3;
    store fit_allenv_sowingrandomslope;
run;

R

fit_allenv_sowingrandomslope <- lmer(Yield ~ Genotype * Sowing + (1 + Sowing | Environment) + (1 | Sowing:Block:Environment), 
                                     data = fava)

No more singular fit message!

ranef(fit_allenv_sowingrandomslope)$Environment

Random slope for spring sowing treatment is positive for the two 2003-04 environments
This makes sense if you look at the raw data plot
So it’s probably good to keep allowing sowing effect to vary by environment

Check residual diagnostics again

check_model(fit_allenv_sowingrandomslope, check = c('linearity', 'homogeneity', 'qq'))

Compare models

Use AIC information criterion to compare models
SAS proc glimmix outputs AIC by default
In R, use AIC() to compare several models at once

AIC(fit_allenv, fit_allenv_noblock, fit_allenv_allrandomslopes, fit_allenv_sowingrandomslope)

Sowing random slope model has lowest AIC (best fit)

Keep in mind AIC is a relative measure!

meme about statisticians

Estimated marginal means

Expected values of means by treatment, averaged across other fixed effects and random effects
Confidence limits with approximated degrees of freedom
Allows pairwise comparisons with multiple-comparisons adjustment of p-values
Popularized by SAS lsmeans
R package emmeans by University of Iowa statistician Russ Lenth does all lsmeans does, and more

Estimated marginal means: SAS

Separated out into proc plm but could also be done in proc glimmix

proc plm restore = fit_allenv_sowingrandomslope;
    lsmeans Genotype Sowing / diff linestable adjust = sidak;
    slice Genotype * Sowing / sliceby = Sowing diff linestable adjust = sidak;
  ods select lsmeans lsmlines slicediffs slicelines;
run;

Estimated marginal means: R

Syntax: emmeans(fit, ~ fixedeffects)
- fit is fitted model object
- one-sided formula, separate fixed effects with +
- group by using | to compare means by treatment, within another treatment

emm_geno <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype)
emm_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Sowing)
emm_geno_x_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype + Sowing)
emm_geno_by_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype | Sowing)

Summary of emmeans

Type name of object in console to get summary
95% confidence intervals
By default, Kenward-Roger method used to approximate degrees of freedom
Also try plot(emm) to make a basic plot with confidence intervals

emm_geno
emm_sowing
emm_geno_x_sowing
emm_geno_by_sowing

Pairwise treatment contrasts

No adjustment needed for sowing (only one comparison)
There are 6 genotypes so there are 15 pairwise comparisons
p-values adjusted with Sidak adjustment, but other methods can be used

contrast(emm_geno, 'pairwise', adjust = 'sidak')
contrast(emm_sowing, 'pairwise')

Pairwise contrasts for multiple treatments

contrast(emm_geno_x_sowing, 'pairwise', adjust = 'sidak')
contrast(emm_geno_by_sowing, 'pairwise', adjust = 'sidak')

Multiple comparison letters

In SAS, we use lines and/or linestable options in lsmeans statement
In R, we can use cld() from the multcomp package (CLD = compact letter display)
Use emmeans object as first argument
Again, other adjustment methods may be used

cld(emm_geno, adjust = 'sidak', Letters = letters)
cld(emm_geno_by_sowing, adjust = 'sidak', Letters = letters)

Conclusion

What have you learned in this lesson?

How to fit linear mixed models to data from split-plot experiments
How to fit linear mixed models to data from experiments repeated at multiple locations
How to fit linear mixed models with different random intercepts and random slopes
How to diagnose and deal with singular fits
How to generate estimated marginal means and compare them with post-hoc hypothesis tests

Impressive!