R for SAS users: Lesson 2

What is this workshop?

  • For SAS users who want to learn R
  • Practicing researchers who have a decent working knowledge of SAS and of basic statistical analysis
  • SAS code is provided for comparison purposes
  • Lesson 2 in a series

Conceptual learning objectives

During this workshop, you will …

  • Review the steps of the “data to doc” pipeline we covered in Lesson 1
  • Learn how to fit linear mixed models to different experimental designs
  • Learn about estimated marginal means, otherwise known as least-square means

Practical skills

We will …

  • Import the data from a CSV file
  • Clean and reshape the data
  • Calculate some summary statistics and make some exploratory plots
  • Fit a linear mixed-effects model with categorical fixed effects
  • Fit and compare linear mixed-effects models with random intercepts and random slopes
  • Make plots and tables of results

How to follow along with this workshop

  • Slides and text version of lessons are online
    • usda-ree-ars.github.io/SEAStats
  • Fill in R code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

Load R packages

  • lme4 is a modern package for mixed model fitting, limited to simpler models
  • For more complex models, alternatives like glmmTMB and brms (Bayesian) are available
library(tidyverse)
library(lme4)
library(easystats)
library(lmerTest)
library(emmeans)
library(multcomp)

theme_set(theme_bw())

Background on dataset

  • Fava beans!
  • Provided by Andrea Onofri, an R blogger

fava bean plant with flowers
Image by Josep Gesti

  • Compare performance (yield) of six fava bean genotypes
  • Two sowing treatments: spring sowing and autumn sowing

fava beans (ful medames) with hummus
Image by News from Jerusalem

Experimental design

  • Split-plot design
  • Sowing time (2 levels, spring and autumn) is main plot factor
  • Genotype (6 levels) is the subplot factor
  • Four replicated blocks
  • Entire experiment replicated in two locations (Badiola & Papiano)
  • Repeated in both locations for three seasons (2001-02, 2002-03, 2003-04)

Repeated split plot design

  • Main plot factor: sowing treatment (spring vs. autumn)
  • Subplot factor: genotype (6 genotypes in this study)
    • The main sowing plots are split into six subplots, one for each genotype
    • Note randomization of genotype position
  • Each block, or “rep,” is replicated four times
    • Note randomization of both genotype and sowing position
  • The whole setup is repeated at two different locations
  • At each of the two locations, the experiment is repeated for three years
  • We’re considering location-year combination to be unique “environments”
  • Replication within environment and repetition at multiple environments are both important!

Import the data from a file

SAS

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv";
proc import datafile = csvFile out = fava replace dbms = csv; run;

R

If you are using cloud server:

fava <- read_csv('data/fava.csv')

If you are running code locally:

fava <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv')

Examine the data

SAS

proc print data = fava(obs = 10); run;

ods select variables;
proc contents data = fava; run;

R

fava

glimpse(fava)
  • Reminder that you can just type the name of the dataframe into the console to see the first few rows and some summary info

Pre-processing: SAS

  • Create new Environment column, combining Location and Year
    • In SAS this is done with the data step
data fava; set fava;
    Environment = catx('_', Location, Year);
run; 

Pre-processing: R

  • Convert Location, Year, Sowing, Genotype, and Block columns to factors in R
    • In SAS this will be done with the class statement in the modeling proc
  • Also create Environment column as in SAS
  • mutate(): Transform existing columns or create new ones
  • Use %>% operator for better code readability
  • across(): Apply same function to many columns, can use : shorthand
  • Assign the result back to fava
fava <- fava %>%
  mutate(
    across(Location:Block, factor),
    Environment = interaction(Location, Year)
  )

Exploratory plots

  • SAS proc sgplot and proc sgpanel with vbox statement
  • R ggplot() with geom_boxplot(), optionally facet_grid() to make panels

SAS

proc sgplot data = fava;
    vbox Yield / group = Genotype;
run; 

proc sgplot data = fava;
    vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;
    
proc sgpanel data = fava;
    panelby Environment;
    vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;

R

ggplot(fava, aes(x = Genotype, y = Yield)) + 
  geom_boxplot()

ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) + 
  geom_boxplot()

ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) + 
  geom_boxplot() +
  facet_grid(Location ~ Year)
  • If everything is lumped together, not much difference between genotypes
  • If you split by sowing time, you can see a treatment effect
  • Things get messy when you also split by environment
    • sowing time effect is reduced in 2003-04
    • some genotype x environment interaction
  • Important to repeat experiments … and to visualize your data before analyzing!

Fit a model to one environment

  • Start simple and gradually build up to the final model we want
  • Let’s start by fitting a model to the data from one environment
  • Even simpler: first only fit the block model
    • Only random effects, no fixed/treatment effects

Fit a model to one environment: SAS

  • Subset to one environment with data step
  • Use proc glimmix for mixed model fitting
  • My SAS code shows only some of the output
data papiano0203; set fava;
    where Location = 'papiano' & Year = '2002-2003';
run;

proc glimmix data = papiano0203;
    class Environment Block Sowing Genotype;
    model Yield = / solution;
    random Block Sowing(Block) / solution;
  ods select covparms parameterestimates tests3 solutionr;
run;

Fit a model to one environment: R

Yield ~ 1 + (1 | Block) + (1 | Sowing:Block)
  • The dependent or response variable (Yield) is on the left side of the formula
  • A tilde ~ separates the dependent from the independent variables
  • Independent variables, including both fixed and random terms, are separated with +
  • Here the only fixed effect is the global intercept (1)
  • Random effects terms have a design side (on the left hand) and group side (on the right hand) separated by |.
    • 1 on the design side means only fit random intercepts and no random slopes
    • (1 | Block): each block will get its own random intercept
    • (1 | Sowing:Block): each main plot (unique combinations of Sowing and Block) will get its own random intercept
  • data argument specifies the data frame that has the variables in the formula
  • subset argument selects rows of the data frame to fit the model to
    • Multiple conditions separated with &
fit_1env <- lmer(Yield ~ 1 + (1 | Block) + (1 | Sowing:Block), 
                 data = fava, 
                 subset = Location == 'papiano' & Year == '2002-2003')

Examine model output in R

summary(fit_1env)
ranef(fit_1env)
icc(fit_1env, by_group = TRUE)
  • summary(): residual summary, random effect variance/covariance, fixed effect estimates
    • Here random effects are block (Block), main plot (Sowing:Block), and subplot (Residual because each data point is a subplot)
    • Fixed effect is the global intercept
  • ranef() shows random effects, like random / solution in SAS
    • Some main plots have higher yield than others but little difference between blocks
  • icc() shows intraclass correlation coefficient
    • ~80% of variation at this location is among the main plots

Model with fixed effects: SAS

proc glimmix data = papiano0203;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Block Sowing(Block) / solution;
  ods select covparms parameterestimates tests3 solutionr;
run;

Model with fixed effects: R

  • Fixed effects: Genotype + Sowing + Genotype:Sowing
    • : indicates an interaction
  • Or use shorthand Genotype * Sowing
    • * indicates all combinations of main effects and interaction terms
    • Different symbols than SAS!
fit_1env_fixef <- lmer(Yield ~ Genotype * Sowing + (1 | Block) + (1 | Sowing:Block), 
                       data = fava, 
                       subset = Location == 'papiano' & Year == '2002-2003')
summary(fit_1env_fixef)
ranef(fit_1env_fixef)
icc(fit_1env_fixef, by_group = TRUE)
  • Overall intercept and 5 coefficients for genotype (the first of 6 is the reference level)
  • 1 coefficient for sowing treatment (autumn is reference level)
  • 5 interaction coefficients
  • We can safely ignore correlation between fixed effects because they were experimentally randomized
  • ICCs are much lower: we’ve added fixed effects that explain a lot of the variation
  • Now that sowing (main plot) is included, most random variation is by block
  • If you really want the ANOVA table, use anova()
    • only if you fit the model with lmerTest package loaded
    • set method of approximating degrees of freedom: anova(fit, ddf = 'Kenward-Roger')

Fit model to all environments

  • Let’s fit a model to the full dataset to see if effects vary across environments
  • Should environment be fixed or random? Answer: It depends.
    • Here we will treat it as random
    • We treat environments as samples from a population of potential environments
    • Here we care more about overall effects, not comparing mean of environment A to mean of environment B
  • We need more random intercept terms now
    • environment (combination of location and year)
    • block nested within environment
    • main plot nested within block nested within environment

Fit model to all environments: SAS

proc glimmix data = fava;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Environment Block(Environment) Sowing(Block Environment) / solution;
  ods select parameterestimates tests3;
run;

Fit model to all environments: R

fit_allenv <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Block:Environment) + (1 | Sowing:Block:Environment), 
                   data = fava)
  • Random intercepts: Environment, Block:Environment, Sowing:Block:Environment (main plot)
  • We get a singular fit message!

Singular fit

  • What does that mean?!?!?
  • At least one variance or covariance of the random effects, or linear combinations of random effects, is exactly equal to zero or one
  • OK, what does that mean in English?
  • The algorithm that fits the model parameters doesn’t have enough data to get a good estimate
    • The model may be too complex for the data we have
    • Or some of the random effects may be very small and can’t be distinguished from zero
    • Some random effects, or combinations of them, may be perfectly correlated with each other

What can you do about a singular fit?

  • It often occurs when trying to fit a model with complex multilevel nested structure to a fairly small dataset
    • That’s what we’re trying to do here!
  • Look at the variance-covariance matrix to see which terms are zero or perfectly correlated with others
  • Refit the model with fewer random effects terms
  • Or you can just go ahead with the singular fit model
VarCorr(fit_allenv)
  • Blocks within each environment have zero variance (block effect is so small it can’t be estimated)

Refit model without block random intercept

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Environment Sowing(Block Environment) / solution;
  ods select parameterestimates tests3;
run;

R

fit_allenv_noblock <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Sowing:Block:Environment), 
                           data = fava)
  • This deals with the singular fit message!

Check model diagnostics

  • In SAS plots = residualpanel option within proc glimmix gives you diagnostics
  • In R we will use check_model() from the easystats package again
check_model(fit_allenv_noblock, check = c('linearity', 'homogeneity', 'qq'))
  • Diagnostic plots look great (no trends, straight Q-Q line)

Fit models with random slope terms

  • Random intercepts: “baseline” may vary by environment but same effect size of treatment across environments
  • Random slopes: allow one or more treatment effect to vary by environment
  • you could also have random slopes by block or main plot but we won’t do that here
  • Add additional terms to the design (left) side of the random effect formula in lme4 syntax
    • (1 + Genotype + Sowing | Environment)
    • Separate intercept (1), genotype effect, and sowing effect for each environment

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random intercept Genotype Sowing / subject = Environment;
    random intercept / subject = Sowing(Block Environment);
  ods select parameterestimates tests3;
run;

R

fit_allenv_allrandomslopes <- lmer(Yield ~ Genotype * Sowing + (1 + Genotype + Sowing | Environment) + (1 | Sowing:Block:Environment), 
                                   data = fava)
  • Singular fit again!
VarCorr(fit_allenv_allrandomslopes)
  • No individual effects are zero but some linear combinations of random effects may be equivalent

Simpler random slopes model

  • Get rid of the random slope for genotype, keep the random slope for sowing.

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random intercept Sowing / subject = Environment solution;
    random intercept / subject = Sowing(Block Environment);
  ods select parameterestimates tests3;
    store fit_allenv_sowingrandomslope;
run;

R

fit_allenv_sowingrandomslope <- lmer(Yield ~ Genotype * Sowing + (1 + Sowing | Environment) + (1 | Sowing:Block:Environment), 
                                     data = fava)
  • No more singular fit message!
ranef(fit_allenv_sowingrandomslope)$Environment
  • Random slope for spring sowing treatment is positive for the two 2003-04 environments
  • This makes sense if you look at the raw data plot
  • So it’s probably good to keep allowing sowing effect to vary by environment

Check residual diagnostics again

check_model(fit_allenv_sowingrandomslope, check = c('linearity', 'homogeneity', 'qq'))

Compare models

  • Use AIC information criterion to compare models
  • SAS proc glimmix outputs AIC by default
  • In R, use AIC() to compare several models at once
AIC(fit_allenv, fit_allenv_noblock, fit_allenv_allrandomslopes, fit_allenv_sowingrandomslope)
  • Sowing random slope model has lowest AIC (best fit)

Keep in mind AIC is a relative measure!

meme about statisticians

Estimated marginal means

  • Expected values of means by treatment, averaged across other fixed effects and random effects
  • Confidence limits with approximated degrees of freedom
  • Allows pairwise comparisons with multiple-comparisons adjustment of p-values
  • Popularized by SAS lsmeans
  • R package emmeans by University of Iowa statistician Russ Lenth does all lsmeans does, and more

Estimated marginal means: SAS

  • Separated out into proc plm but could also be done in proc glimmix
proc plm restore = fit_allenv_sowingrandomslope;
    lsmeans Genotype Sowing / diff linestable adjust = sidak;
    slice Genotype * Sowing / sliceby = Sowing diff linestable adjust = sidak;
  ods select lsmeans lsmlines slicediffs slicelines;
run;

Estimated marginal means: R

  • Syntax: emmeans(fit, ~ fixedeffects)
    • fit is fitted model object
    • one-sided formula, separate fixed effects with +
    • group by using | to compare means by treatment, within another treatment
emm_geno <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype)
emm_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Sowing)
emm_geno_x_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype + Sowing)
emm_geno_by_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype | Sowing)

Summary of emmeans

  • Type name of object in console to get summary
  • 95% confidence intervals
  • By default, Kenward-Roger method used to approximate degrees of freedom
  • Also try plot(emm) to make a basic plot with confidence intervals
emm_geno
emm_sowing
emm_geno_x_sowing
emm_geno_by_sowing

Pairwise treatment contrasts

  • No adjustment needed for sowing (only one comparison)
  • There are 6 genotypes so there are 15 pairwise comparisons
  • p-values adjusted with Sidak adjustment, but other methods can be used
contrast(emm_geno, 'pairwise', adjust = 'sidak')
contrast(emm_sowing, 'pairwise')

Pairwise contrasts for multiple treatments

contrast(emm_geno_x_sowing, 'pairwise', adjust = 'sidak')
contrast(emm_geno_by_sowing, 'pairwise', adjust = 'sidak')

Multiple comparison letters

  • In SAS, we use lines and/or linestable options in lsmeans statement
  • In R, we can use cld() from the multcomp package (CLD = compact letter display)
  • Use emmeans object as first argument
  • Again, other adjustment methods may be used
cld(emm_geno, adjust = 'sidak', Letters = letters)
cld(emm_geno_by_sowing, adjust = 'sidak', Letters = letters)

Conclusion

What have you learned in this lesson?

  • How to fit linear mixed models to data from split-plot experiments
  • How to fit linear mixed models to data from experiments repeated at multiple locations
  • How to fit linear mixed models with different random intercepts and random slopes
  • How to diagnose and deal with singular fits
  • How to generate estimated marginal means and compare them with post-hoc hypothesis tests

Impressive!