Introduction

What is this workshop?

This workshop is intended for SAS users who want to learn R. The people who will get the most out of this course are practicing researchers who have a decent working knowledge of SAS, and of basic statistical analysis (descriptive stats and regression models) as it applies to their field.

This is lesson 2 in a series. Lesson 1 covered the basics: importing data, cleaning and reshaping data, summary statistics, simple graphs and tables, and a few simple statistical models.

Download the worksheet for this lesson here.

What will you learn from this workshop?

Conceptual learning objectives

During this workshop, you will …

  • Review the steps of the “data to doc” pipeline we covered in Lesson 1
  • Learn how to fit linear mixed models to different experimental designs
  • Learn about estimated marginal means, otherwise known as least-square means

Practical skills

As in Lesson 1, we will work through a “data to doc” pipeline in R, comparing R code to SAS code each step of the way. We will use a different dataset this time.

We will …

  • Import the data from a CSV file
  • Clean and reshape the data
  • Calculate some summary statistics and make some exploratory plots
  • Fit a linear mixed-effects model with categorical fixed effects
  • Fit and compare linear mixed-effects models with random intercepts and random slopes
  • Make plots and tables of results

How to follow along with this workshop

  • Slides and text version of lessons are online
  • Fill in R code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind
  • Notes on best practices will be marked with PROTIP as we go along!

Data to doc pipeline, take two

As in the previous lesson, we will start with raw data and work our way to a finished product. The first few steps of the pipeline will not be completely new to you if you did Lesson 1 … but it is good to get some extra practice!

Load R packages

Here we’ll load the R packages we are going to work with today. These are mostly the same as Lesson 1. This includes the tidyverse package for reading, manipulating, and plotting data, the lme4 package for fitting linear mixed models, and the easystats package which has some good model diagnostic plots. Also set a default plotting theme.

Note about packages: R has lots of mixed model fitting packages. This fits in with the idea that the best thing about R is “there are many ways to do something,” and the worst thing about R is also “there are many ways to do something.” I would say that nowadays the lme4 package is probably the most widely used package for fitting linear mixed models, so it’s important to be familiar with it if you are doing stats with R. The nlme package also has some useful capabilities so I would also recommend familiarizing yourself with it. Even more complex models can be fit with glmmTMB and other packages. Ultimately, you can fit multilevel models of any level of sophistication using Bayesian methods with packages like brms.

library(tidyverse)
library(lme4)
library(easystats)
library(lmerTest)
library(emmeans)
library(multcomp)

theme_set(theme_bw())

Background on dataset

In this lesson, we’re going to use a dataset kindly provided by Andrea Onofri, a regular contributor to r-bloggers. This tutorial is loosely based on this blog post.

fava bean plant with flowers

fava beans (ful medames) with hummus
fava beans growing in the field … and cooked into some delicious ful medames! (Images by Josep Gesti and News from Jerusalem)

Today we are going to explore the wonderful world of fava beans! This dataset, fava.csv, includes the results of a study comparing the performance (yield) of six different fava bean genotypes under two different sowing treatments: spring and autumn sowing.

Let’s try to understand the experimental design. This study was designed as a split-plot experiment with multiple replicates (blocks). Sowing time (spring and autumn) is the main plot factor and genotype (Chiaro, Collameno, Palomb, Scuro, Sicania, and Vesuvio) is the subplot factor. There are 2 sowing treatments and 6 genotypes, for a total of 12 genotype-treatment combinations. Furthermore, the experiment was repeated in two locations in Italy (Badiola and Papiano) for three years each (2001-2002, 2002-2003, and 2003-2004).

What does split-plot mean? This means that each block consists of two main plots (one sown in spring and the other in autumn), and each of those main plots is divided up into six subplots, with a different genotype planted in each one. There are four blocks at each of the locations. For the purposes of this lesson, we will consider each combination of location and year (six in all) to be a unique “environment.” It is crucial that we have replication within each environment, as well as repetition of the experiment in different environments so that we can determine how consistent the relative performance of the genotypes is under the different treatments, as environmental conditions vary.

The layout of the experiment at one of the two locations in one of the three seasons might look like the map below. Notice all plots sharing a sowing time are contiguous because the sowing main plots are split into genotype subplots, and notice that all twelve treatment combinations are present in each block. Also notice the randomized order of both treatments. This same setup with different randomized positions is repeated in each of the two locations in each of the three years.

Import the data from a file

As in Lesson 1, the example data for this lesson are hosted on GitHub, or in the data directory if you are on the cloud server.

SAS

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv";
proc import datafile = csvFile out = fava replace dbms = csv; run;

R

If you are running this on the cloud server:

fava <- read_csv('data/fava.csv')
Rows: 288 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Location, Year, Sowing, Genotype
dbl (2): Block, Yield

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

If you are running this on your own machine:

fava <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv')

Examine the data

In SAS, we would use proc print and proc contents to examine our dataset. In R tidyverse, we can type the name of the dataframe into the console to see the first few rows and print the data type of each column. The function glimpse() also provides a concise summary.

SAS

proc print data = fava(obs = 10); run;

ods select variables;
proc contents data = fava; run;
Obs Location Year Sowing Genotype Block Yield
1 papiano 2002-2003 autumn Chiaro 1 2.05
2 papiano 2002-2003 autumn Chiaro 2 2.5
3 papiano 2002-2003 autumn Chiaro 3 2.64
4 papiano 2002-2003 autumn Chiaro 4 2.45
5 papiano 2002-2003 autumn Collameno 1 2.01
6 papiano 2002-2003 autumn Collameno 2 2.19
7 papiano 2002-2003 autumn Collameno 3 2.38
8 papiano 2002-2003 autumn Collameno 4 2.35
9 papiano 2002-2003 autumn Palomb 2 2.38
10 papiano 2002-2003 autumn Palomb 1 2.42

Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
5 Block Num 8 BEST12. BEST32.
4 Genotype Char 11 $11. $11.
1 Location Char 9 $9. $9.
3 Sowing Char 8 $8. $8.
2 Year Char 11 $11. $11.
6 Yield Num 8 BEST12. BEST32.

R

fava
# A tibble: 288 × 6
   Location Year      Sowing Genotype  Block Yield
   <chr>    <chr>     <chr>  <chr>     <dbl> <dbl>
 1 papiano  2002-2003 autumn Chiaro        1  2.05
 2 papiano  2002-2003 autumn Chiaro        2  2.5 
 3 papiano  2002-2003 autumn Chiaro        3  2.64
 4 papiano  2002-2003 autumn Chiaro        4  2.45
 5 papiano  2002-2003 autumn Collameno     1  2.01
 6 papiano  2002-2003 autumn Collameno     2  2.19
 7 papiano  2002-2003 autumn Collameno     3  2.38
 8 papiano  2002-2003 autumn Collameno     4  2.35
 9 papiano  2002-2003 autumn Palomb        2  2.38
10 papiano  2002-2003 autumn Palomb        1  2.42
# ℹ 278 more rows
glimpse(fava)
Rows: 288
Columns: 6
$ Location <chr> "papiano", "papiano", "papiano", "papiano", "papiano", "papia…
$ Year     <chr> "2002-2003", "2002-2003", "2002-2003", "2002-2003", "2002-200…
$ Sowing   <chr> "autumn", "autumn", "autumn", "autumn", "autumn", "autumn", "…
$ Genotype <chr> "Chiaro", "Chiaro", "Chiaro", "Chiaro", "Collameno", "Collame…
$ Block    <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2…
$ Yield    <dbl> 2.05, 2.50, 2.64, 2.45, 2.01, 2.19, 2.38, 2.35, 2.38, 2.42, 2…

Pre-processing

Let’s go ahead and convert the Location, Year, Sowing, Genotype, and Block columns to factors because they are all discrete. This isn’t strictly necessary for the ones other than Block because R will automatically convert the character variables to factors when fitting the models, but it is a good practice. We only need to do this in R right now because in SAS this is done using the class statement in the modeling proc.

Also, let’s create a new column called Environment which is the combination of Location and Year, because we’re considering location-year combinations to be unique environments for this analysis. The interaction function combines two or more factors to produce a new factor. We’ll do this in both SAS and R. In SAS it’s done with a data step.

SAS

data fava; set fava;
    Environment = catx('_', Location, Year);
run; 

R

The mutate() function is used to transform existing columns or create new ones. We do both here. Note that we assign the result back to fava, which “overwrites” the dataframe with our modified one. The pipe operator %>% is used to put each data-manipulation operation on its own line for better code readability.

fava <- fava %>%
  mutate(
    across(Location:Block, factor),
    Environment = interaction(Location, Year)
  )

Notice the across() function is used to apply the same function, factor(), to many columns. We can use the : shorthand to apply the function to a block of contiguous columns.

Exploratory plots

Let’s take a look at the data. In SAS we can use proc sgplot with a vbox statement to make boxplots, or proc sgpanel if we want multiple panels. In R with the ggplot2 package, we use geom_boxplot() to draw boxplots, the group aesthetic mapping to make multiple boxplots split up by a grouping variable, and facet_grid() to make a grid of panels split up by additional variables, each with their own boxplots.

SAS

proc sgplot data = fava;
    vbox Yield / group = Genotype;
run; 

proc sgplot data = fava;
    vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;
    
proc sgpanel data = fava;
    panelby Environment;
    vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;
The SGPlot Procedure 0 1 2 3 4 5 Yield Vesuvio Sicania Scuro Palomb Collameno Chiaro Genotype

The SGPlot Procedure Chiaro Collameno Palomb Scuro Sicania Vesuvio Genotype 0 1 2 3 4 5 Yield spring autumn Sowing

The SGPanel Procedure autumn spring Sowing Genotype Yield Environment = papiano_2003- ... Environment = papiano_2002- ... Environment = papiano_2001- ... Environment = badiola_2003-2 ... Environment = badiola_2002-2 ... Environment = badiola_2001-2 ... Chiaro Collameno Palomb Scuro Sicania Vesuvio Chiaro Collameno Palomb Scuro Sicania Vesuvio Chiaro Collameno Palomb Scuro Sicania Vesuvio 0 1 2 3 4 5 0 1 2 3 4 5

R

ggplot(fava, aes(x = Genotype, y = Yield)) + 
  geom_boxplot()

boxplot of yield by genotype

ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) + 
  geom_boxplot()

boxplot of yield by genotype crossed with sowing treatment

ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) + 
  geom_boxplot() +
  facet_grid(Location ~ Year)

separate boxplots of yield by genotype and sowing treatment for each location and year

PROTIP: I won’t get into plots that much in this lesson, but check out my ggplot2 lesson to learn more.

The first plot shows that when we lump all the data together, the genotypes’ yield distributions are pretty similar. But we have to disaggregate things a little bit to start to see patterns. First, when we split the boxplots for each genotype up by sowing treatment, we can see a consistent pattern where autumn sowing leads to increased yield, with roughly the same effect in all six genotypes. The final plot, “faceted” by location and year, starts to really get interesting when we drill down (real datasets are like that)! We see that the sowing time effect is present in 2001-02 and 2002-03, but seems to be much reduced in 2003-04 in both locations, maybe due to climate variation. Yield was much lower in 2002-03 for both locations. There seems to be a hint of genotype by environment interaction as well.

All of this should make it clear how important it is to repeat experiments under different environmental conditions, and show us that our statistical model will have to account for interactions between genotype and treatment, and for effects that vary between environments.

Fit a model to one environment

When we are fitting models, a good way to proceed is to start simple and gradually build up to the final model we want. In this case, let’s start by subsetting the data to one environment only, and see if we can successfully fit a model.

Block model

We’ll start even simpler by only fitting the block model (random effects only) and ignoring the treatments for now.

SAS

Here, we use a data step to subset a single environment (Papiano in 2002-03), then fit a model. For all the SAS model-fitting code in this lesson, proc glimmix is a good option.

NOTE: By default, proc glimmix produces a lot of output. I have selectively displayed only a little of that output for comparison purposes, using ods select statements.

data papiano0203; set fava;
    where Location = 'papiano' & Year = '2002-2003';
run;

proc glimmix data = papiano0203;
    class Environment Block Sowing Genotype;
    model Yield = / solution;
    random Block Sowing(Block) / solution;
  ods select covparms parameterestimates tests3 solutionr;
run;
Covariance Parameter Estimates
Cov Parm Estimate Standard
Error
Block 0 .
Sowing(Block) 0.4051 0.2263
Residual 0.1094 0.02445
Solutions for Fixed Effects
Effect Estimate Standard
Error
DF t Value Pr > |t|
Intercept 1.7279 0.2300 3 7.51 0.0049
Solution for Random Effects
Effect Sowing Block Estimate Std Err Pred DF t Value Pr > |t|
Block   1 0 . . . .
Block   2 0 . . . .
Block   3 0 . . . .
Block   4 0 . . . .
Sowing(Block) autumn 1 0.3465 0.2567 40 1.35 0.1847
Sowing(Block) spring 1 -0.6551 0.2567 40 -2.55 0.0146
Sowing(Block) autumn 2 0.5522 0.2567 40 2.15 0.0376
Sowing(Block) spring 2 -0.5562 0.2567 40 -2.17 0.0363
Sowing(Block) autumn 3 0.6655 0.2567 40 2.59 0.0133
Sowing(Block) spring 3 -0.5052 0.2567 40 -1.97 0.0560
Sowing(Block) autumn 4 0.7245 0.2567 40 2.82 0.0074
Sowing(Block) spring 4 -0.5722 0.2567 40 -2.23 0.0315

R

We will use the lmer() function from the lme4 package to fit the model in R. The model formula syntax is a little different than in SAS.

  • The dependent or response variable (Yield) is on the left side of the formula.
  • A tilde ~ separates the dependent from the independent variables.
  • Independent variables, including both fixed and random terms, are separated with +.
  • Here the only fixed effect is the global intercept (1).
  • The random effects terms have a design side (on the left hand) and group side (on the right hand) separated by |.
  • In this case, the 1 on the design side means only fit random intercepts and no random slopes.
  • In the term (1 | Block), we are saying each block will get its own random intercept.
  • In the term (1 | Sowing:Block), we are saying each main plot (defined by unique combinations of Sowing and Block) will get its own random intercept.

The data argument specifies the data frame that has the variables in the formula. Here we’re also using a subset argument to specify we are only fitting the model to the rows of the dataframe fava where Location is equal to the string 'papiano' and Year is equal to the string '2002-2003' (multiple logical conditions are separated with &).

fit_1env <- lmer(Yield ~ 1 + (1 | Block) + (1 | Sowing:Block), 
                 data = fava, 
                 subset = Location == 'papiano' & Year == '2002-2003')

Next let’s look at some of the output.

summary(fit_1env)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Yield ~ 1 + (1 | Block) + (1 | Sowing:Block)
   Data: fava
 Subset: Location == "papiano" & Year == "2002-2003"

REML criterion at convergence: 55.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.3438 -0.6279 -0.2337  0.6757  2.1901 

Random effects:
 Groups       Name        Variance  Std.Dev. 
 Sowing:Block (Intercept) 4.051e-01 6.365e-01
 Block        (Intercept) 9.150e-09 9.566e-05
 Residual                 1.094e-01 3.307e-01
Number of obs: 48, groups:  Sowing:Block, 8; Block, 4

Fixed effects:
            Estimate Std. Error    df t value Pr(>|t|)    
(Intercept)    1.728      0.230 7.000   7.511 0.000136 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ranef(fit_1env)
$`Sowing:Block`
         (Intercept)
autumn:1   0.3464950
autumn:2   0.5522388
autumn:3   0.6654777
autumn:4   0.7244895
spring:1  -0.6551107
spring:2  -0.5562261
spring:3  -0.5051889
spring:4  -0.5721752

$Block
    (Intercept)
1 -6.969903e-09
2 -9.005043e-11
3  3.620027e-09
4  3.439927e-09

with conditional variances for "Sowing:Block" "Block" 
icc(fit_1env, by_group = TRUE)
# ICC by Group

Group        |       ICC
------------------------
Sowing:Block |     0.787
Block        | 1.778e-08

The summary() function gives you a summary of the residuals, the variance of the different random effects, and the fixed effects, very similar to what you get from proc glimmix output. The random effects are block (Block), main plot (Sowing:Block), and subplot (Residual because each data point is a subplot, so the model residuals are the errors for each individual subplot). The fixed effect is the global intercept.

In R, ranef() is the function that displays the random effects. This is equivalent to adding the solution option to the random statement in SAS. When we call ranef(), we see that some main plots have higher yield than others. However the random intercepts for blocks are very close to zero. This indicates that most of the variation is between main plots, with very little systematic differences between the blocks.

In R the icc() function provides the intraclass correlation coefficient separated by random effect term. We see that around 80% of the variation is at the main plot level at this particular environment (we haven’t included any fixed effects yet so this is just telling us what level the variation is found at). Almost no additional variation is explained by the block term (as we could have guessed from looking at ranef() output). The remaining 20% is residual variation at the subplot level. Because every data point is an individual subplot, it is not needed to put a special subplot term into the model. The model residuals, after accounting for any fixed effects, the block term, and the main plot term, represent the subplot error.

PROTIP: It is fairly straightforward to calculate the ICC in SAS by extracting the variance components from the model and taking ratios, but unfortunately there is no statement in PROC GLIMMIX that does it directly. Thus I didn’t include the ICC in the SAS results here. You can obtain the value 0.787 for the ICC of main plot using the ratio of the variance due to main plot to the total variance: 0.4051 / (0.4051 + 0.1094) = 0.787. The ICC of block is effectively 0.

Model with fixed effects

Let’s add some fixed effects for genotype, sowing treatment, and their interaction. The fixed effect portion of the model formula in R can be written as Genotype + Sowing + Genotype:Sowing, where the : indicates an interaction. Or it can be written using the shorthand Genotype * Sowing, where * indicates all combinations of main effects and interaction terms. The : is equivalent to * in SAS, and the * is equivalent to | (a little confusing)!

SAS

proc glimmix data = papiano0203;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Block Sowing(Block) / solution;
  ods select covparms parameterestimates tests3 solutionr;
run;
Covariance Parameter Estimates
Cov Parm Estimate Standard
Error
Block 0.008972 0.01123
Sowing(Block) 0.003253 0.006894
Residual 0.03023 0.007805
Solutions for Fixed Effects
Effect Sowing Genotype Estimate Standard
Error
DF t Value Pr > |t|
Intercept     0.9375 0.1030 3 9.10 0.0028
Genotype   Chiaro 0.4675 0.1229 30 3.80 0.0007
Genotype   Collameno 0.04750 0.1229 30 0.39 0.7020
Genotype   Palomb -0.05500 0.1229 30 -0.45 0.6578
Genotype   Scuro 0.8100 0.1229 30 6.59 <.0001
Genotype   Sicania -0.1150 0.1229 30 -0.94 0.3570
Genotype   Vesuvio 0 . . . .
Sowing autumn   1.3275 0.1294 3 10.26 0.0020
Sowing spring   0 . . . .
Sowing*Genotype autumn Chiaro -0.3225 0.1739 30 -1.85 0.0735
Sowing*Genotype autumn Collameno -0.08000 0.1739 30 -0.46 0.6487
Sowing*Genotype autumn Palomb 0.4750 0.1739 30 2.73 0.0104
Sowing*Genotype autumn Scuro -0.9475 0.1739 30 -5.45 <.0001
Sowing*Genotype autumn Sicania 0.08500 0.1739 30 0.49 0.6285
Sowing*Genotype autumn Vesuvio 0 . . . .
Sowing*Genotype spring Chiaro 0 . . . .
Sowing*Genotype spring Collameno 0 . . . .
Sowing*Genotype spring Palomb 0 . . . .
Sowing*Genotype spring Scuro 0 . . . .
Sowing*Genotype spring Sicania 0 . . . .
Sowing*Genotype spring Vesuvio 0 . . . .
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Genotype 5 30 7.90 <.0001
Sowing 1 3 344.95 0.0003
Sowing*Genotype 5 30 15.05 <.0001
Solution for Random Effects
Effect Sowing Block Estimate Std Err Pred DF t Value Pr > |t|
Block   1 -0.1103 0.06610 30 -1.67 0.1056
Block   2 -0.00142 0.06610 30 -0.02 0.9829
Block   3 0.05728 0.06610 30 0.87 0.3931
Block   4 0.05443 0.06610 30 0.82 0.4168
Sowing(Block) autumn 1 -0.04926 0.05122 30 -0.96 0.3439
Sowing(Block) spring 1 0.009269 0.05122 30 0.18 0.8576
Sowing(Block) autumn 2 -0.00761 0.05122 30 -0.15 0.8828
Sowing(Block) spring 2 0.007098 0.05122 30 0.14 0.8907
Sowing(Block) autumn 3 0.01578 0.05122 30 0.31 0.7601
Sowing(Block) spring 3 0.004990 0.05122 30 0.10 0.9230
Sowing(Block) autumn 4 0.04109 0.05122 30 0.80 0.4287
Sowing(Block) spring 4 -0.02136 0.05122 30 -0.42 0.6797

R

fit_1env_fixef <- lmer(Yield ~ Genotype * Sowing + (1 | Block) + (1 | Sowing:Block), 
                       data = fava, 
                       subset = Location == 'papiano' & Year == '2002-2003')

summary(fit_1env_fixef)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Yield ~ Genotype * Sowing + (1 | Block) + (1 | Sowing:Block)
   Data: fava
 Subset: Location == "papiano" & Year == "2002-2003"

REML criterion at convergence: -0.7

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.94377 -0.48885  0.07059  0.49097  1.63660 

Random effects:
 Groups       Name        Variance Std.Dev.
 Sowing:Block (Intercept) 0.003253 0.05703 
 Block        (Intercept) 0.008973 0.09472 
 Residual                 0.030229 0.17386 
Number of obs: 48, groups:  Sowing:Block, 8; Block, 4

Fixed effects:
                               Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)                      2.4100     0.1030 21.3948  23.393  < 2e-16 ***
GenotypeCollameno               -0.1775     0.1229 30.0000  -1.444  0.15916    
GenotypePalomb                   0.2750     0.1229 30.0000   2.237  0.03288 *  
GenotypeScuro                   -0.2825     0.1229 30.0000  -2.298  0.02872 *  
GenotypeSicania                 -0.1750     0.1229 30.0000  -1.423  0.16492    
GenotypeVesuvio                 -0.1450     0.1229 30.0000  -1.179  0.24750    
Sowingspring                    -1.0050     0.1294 25.4398  -7.767 3.56e-08 ***
GenotypeCollameno:Sowingspring  -0.2425     0.1739 30.0000  -1.395  0.17333    
GenotypePalomb:Sowingspring     -0.7975     0.1739 30.0000  -4.587 7.46e-05 ***
GenotypeScuro:Sowingspring       0.6250     0.1739 30.0000   3.595  0.00115 ** 
GenotypeSicania:Sowingspring    -0.4075     0.1739 30.0000  -2.344  0.02591 *  
GenotypeVesuvio:Sowingspring    -0.3225     0.1739 30.0000  -1.855  0.07346 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) GntypC GntypP GntypScr GntypScn GntypV Swngsp GntC:S GntP:S
GentypCllmn -0.597                                                            
GenotypPlmb -0.597  0.500                                                     
GenotypeScr -0.597  0.500  0.500                                              
GenotypeScn -0.597  0.500  0.500  0.500                                       
GenotypeVsv -0.597  0.500  0.500  0.500    0.500                              
Sowingsprng -0.628  0.475  0.475  0.475    0.475    0.475                     
GntypCllm:S  0.422 -0.707 -0.354 -0.354   -0.354   -0.354 -0.672              
GntypPlmb:S  0.422 -0.354 -0.707 -0.354   -0.354   -0.354 -0.672  0.500       
GntypScr:Sw  0.422 -0.354 -0.354 -0.707   -0.354   -0.354 -0.672  0.500  0.500
GntypScn:Sw  0.422 -0.354 -0.354 -0.354   -0.707   -0.354 -0.672  0.500  0.500
GntypVsv:Sw  0.422 -0.354 -0.354 -0.354   -0.354   -0.707 -0.672  0.500  0.500
            GntypScr:S GntypScn:S
GentypCllmn                      
GenotypPlmb                      
GenotypeScr                      
GenotypeScn                      
GenotypeVsv                      
Sowingsprng                      
GntypCllm:S                      
GntypPlmb:S                      
GntypScr:Sw                      
GntypScn:Sw  0.500               
GntypVsv:Sw  0.500      0.500    
ranef(fit_1env_fixef)
$`Sowing:Block`
          (Intercept)
autumn:1 -0.049255030
autumn:2 -0.007614710
autumn:3  0.015778601
autumn:4  0.041091139
spring:1  0.009269246
spring:2  0.007098097
spring:3  0.004989209
spring:4 -0.021356553

$Block
   (Intercept)
1 -0.110292120
2 -0.001424963
3  0.057283504
4  0.054433579

with conditional variances for "Sowing:Block" "Block" 
icc(fit_1env_fixef, by_group = TRUE)
# ICC by Group

Group        |   ICC
--------------------
Sowing:Block | 0.077
Block        | 0.211

We have a lot more output to wade through in the summary because we now have a lot of fixed effects. There are 5 coefficients for genotype because there are 6 genotypes, and the first one is the reference level (intercept). There is 1 coefficient for sowing treatment to differentiate spring from the reference level of autumn. Finally, there are 5 interaction coefficients. The summary also shows us the correlation between the fixed effects, which in this case we can safely ignore because the fixed effects were experimentally randomized. You might look at those for an observational study.

We now see completely different ICC results. We are not surprised that the ICCs are much lower because we’ve added fixed effects that explain a lot of the variation. They add up to about 29% of the variation now. However, we see that now that we have accounted for genotype and sowing as fixed effects, the block term is the larger contributor to the random variation relative to main plot (Block:Sowing). This makes sense because the sowing effect seems to be large and when we did not include sowing in the model as a fixed effect, all that variation was assigned to main plot. Now that sowing is included, the main plot random variation is very low.

PROTIP: As before, we can get the ICC result from SAS by working directly with the variance components and getting the ratio of the variance due to each level of random effect to the total variance. ICC of main plot is .003253 / (.008972 + .003253 + 0.03023) = 0.077, and ICC of block is .008972 / (.008972 + .003253 + 0.03023) = 0.211.

PROTIP 2: If you really want an ANOVA table for your mixed model in R, as long as you fit the model with the lmerTest package loaded, you can use anova(fit) to get an ANOVA table, where fit is the name of your fitted model object. You can specify the method of approximating degrees of freedom: anova(fit, ddf = 'Kenward-Roger')

Fit model to all environments

So far so good, but all of that was only a model fit to one environment. We want to fit a model to all the data put together so that we can see whether there are consistent effects of genotype and sowing treatment on yield, that hold regardless of environmental conditions.

Should environment be fixed or random? Answer: It depends. Here, we are going to treat environment (combination of location and year) as a random effect. This means we are considering the environments at which the study was conducted to be samples from a population of potential environments. We want to make an inference about whether the effects we observe are consistent across that whole hypothetical population. We don’t necessarily care that much about specifically saying whether environment A is different from environment B. If we cared more about that, we could consider treating environment as a fixed effect in the analysis.

This means that we need to add more random intercept terms. In the previous model, we modeled all observations from the same block and from the same main plot within each block as being correlated with one another. We still want to do that, but we are going to add another level of nesting. We also want to model all observations from the same environment as being correlated with one another, as well as from the same block within environment, and the same main plot within block within environment. We are really putting the “multi” into “multilevel model!”

SAS

proc glimmix data = fava;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Environment Block(Environment) Sowing(Block Environment) / solution;
  ods select parameterestimates tests3;
run;
Solutions for Fixed Effects
Effect Sowing Genotype Estimate Standard
Error
DF t Value Pr > |t|
Intercept     2.4146 0.4770 5 5.06 0.0039
Genotype   Chiaro 0.4333 0.1170 230 3.70 0.0003
Genotype   Collameno -0.05167 0.1170 230 -0.44 0.6592
Genotype   Palomb 0.02625 0.1170 230 0.22 0.8227
Genotype   Scuro 0.6850 0.1170 230 5.85 <.0001
Genotype   Sicania 0.3779 0.1170 230 3.23 0.0014
Genotype   Vesuvio 0 . . . .
Sowing autumn   1.1308 0.1367 23 8.27 <.0001
Sowing spring   0 . . . .
Sowing*Genotype autumn Chiaro -0.4017 0.1655 230 -2.43 0.0160
Sowing*Genotype autumn Collameno -0.1775 0.1655 230 -1.07 0.2845
Sowing*Genotype autumn Palomb 0.1825 0.1655 230 1.10 0.2712
Sowing*Genotype autumn Scuro -0.5933 0.1655 230 -3.59 0.0004
Sowing*Genotype autumn Sicania -0.4725 0.1655 230 -2.86 0.0047
Sowing*Genotype autumn Vesuvio 0 . . . .
Sowing*Genotype spring Chiaro 0 . . . .
Sowing*Genotype spring Collameno 0 . . . .
Sowing*Genotype spring Palomb 0 . . . .
Sowing*Genotype spring Scuro 0 . . . .
Sowing*Genotype spring Sicania 0 . . . .
Sowing*Genotype spring Vesuvio 0 . . . .
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Genotype 5 230 9.78 <.0001
Sowing 1 23 107.93 <.0001
Sowing*Genotype 5 230 6.50 <.0001

R

We remove the subset argument to fit the model to the full dataset.

fit_allenv <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Block:Environment) + (1 | Sowing:Block:Environment), 
                   data = fava)
boundary (singular) fit: see help('isSingular')

We get a message that the fit is singular. What does this mean? Some components of the variance-covariance matrix of the random effects are either exactly zero or exactly one. OK, what about in English? Basically it means that the algorithm that fits the model parameters doesn’t have enough data to get a good estimate. This often happens when we are trying to fit a model that is too complex for the amount of data we have, or when the random effects are very small and can’t be distinguished from zero. The algorithm fails to estimate the random effects and return values of zero. Or in other cases, some of the random effects, or combinations of random effects, will be perfectly correlated with one another. This often happens when you are trying to fit a model with complex nested structure to a fairly small dataset … just like we are doing here!

To learn more about dealing with this issue, check out this article by Ben Bolker. Warning: the language is a little technical.

Returning to our model, let’s look at the variance-covariance matrix of the random effects to see why the fit is singular.

VarCorr(fit_allenv)
 Groups                   Name        Std.Dev.
 Sowing:Block:Environment (Intercept) 0.24525 
 Block:Environment        (Intercept) 0.00000 
 Environment              (Intercept) 1.14372 
 Residual                             0.40529 

We can see that the blocks within each environment have zero variance. This shows that the effect is so small that the algorithm could not estimate it. We can refit the model with that effect removed, and only random effects for environment and main plot within block within environment.

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Environment Sowing(Block Environment) / solution;
  ods select parameterestimates tests3;
run;
Solutions for Fixed Effects
Effect Sowing Genotype Estimate Standard
Error
DF t Value Pr > |t|
Intercept     2.4146 0.4768 5 5.06 0.0039
Genotype   Chiaro 0.4333 0.1170 230 3.70 0.0003
Genotype   Collameno -0.05167 0.1170 230 -0.44 0.6592
Genotype   Palomb 0.02625 0.1170 230 0.22 0.8227
Genotype   Scuro 0.6850 0.1170 230 5.85 <.0001
Genotype   Sicania 0.3779 0.1170 230 3.23 0.0014
Genotype   Vesuvio 0 . . . .
Sowing autumn   1.1308 0.1367 41 8.27 <.0001
Sowing spring   0 . . . .
Sowing*Genotype autumn Chiaro -0.4017 0.1655 230 -2.43 0.0160
Sowing*Genotype autumn Collameno -0.1775 0.1655 230 -1.07 0.2845
Sowing*Genotype autumn Palomb 0.1825 0.1655 230 1.10 0.2712
Sowing*Genotype autumn Scuro -0.5933 0.1655 230 -3.59 0.0004
Sowing*Genotype autumn Sicania -0.4725 0.1655 230 -2.86 0.0047
Sowing*Genotype autumn Vesuvio 0 . . . .
Sowing*Genotype spring Chiaro 0 . . . .
Sowing*Genotype spring Collameno 0 . . . .
Sowing*Genotype spring Palomb 0 . . . .
Sowing*Genotype spring Scuro 0 . . . .
Sowing*Genotype spring Sicania 0 . . . .
Sowing*Genotype spring Vesuvio 0 . . . .
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Genotype 5 230 9.78 <.0001
Sowing 1 41 107.89 <.0001
Sowing*Genotype 5 230 6.50 <.0001

R

fit_allenv_noblock <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Sowing:Block:Environment), 
                           data = fava)

We no longer get the singular fit message.

Check model diagnostics

Now that we have a good model, we can look at the model diagnostics. This was done in SAS by adding the option plots = residualpanel. We will use check_model() from the easystats package in R, as we did in the previous lesson. Specify we want to check linearity and homogeneity of the residuals, and show a normal quantile-quantile plot of the residuals.

check_model(fit_allenv_noblock, check = c('linearity', 'homogeneity', 'qq'))

model residual diagnostic plots

All of these diagnostic plots look great. There is little to no trend in the residual variance. The individual data points’ residuals fall along the straight line of the normal Q-Q plot.

Fit models with random slope terms

One other thing we may want to include in our model is to allow one or more of the treatment effects to vary by environment. In the model we just fit, we only included random intercepts. This means that the “baseline” yield is allowed to vary between environments, but we are assuming a single effect size of treatment and genotype that holds across all environments. We can relax that constraint using “random slopes.” We can allow one or more of the effects to vary by environment. We could even have random slopes at the block or main plot level, but not only would that be way too complex of a model for our small dataset, it also doesn’t seem as likely to be a large effect. So we will stick with testing random slope terms by environment.

We can start with a complex model that includes random slope terms for both sowing treatment and genotype, with respect to environment. In the lme4 package in R, we can do this by adding terms on the design side of each random effect in the model formula. In this case, we want not only to have a different intercept for each environment, but also to have a different slope for sowing and for genotype. So the Environment term in the formula will be (1 + Genotype + Sowing | Environment).

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random intercept Genotype Sowing / subject = Environment;
    random intercept / subject = Sowing(Block Environment);
  ods select parameterestimates tests3;
run;
Solutions for Fixed Effects
Effect Sowing Genotype Estimate Standard
Error
DF t Value Pr > |t|
Intercept     2.4146 0.4881 5 4.95 0.0043
Genotype   Chiaro 0.4333 0.1467 25 2.95 0.0068
Genotype   Collameno -0.05167 0.1467 25 -0.35 0.7277
Genotype   Palomb 0.02625 0.1467 25 0.18 0.8595
Genotype   Scuro 0.6850 0.1467 25 4.67 <.0001
Genotype   Sicania 0.3779 0.1467 25 2.58 0.0163
Genotype   Vesuvio 0 . . . .
Sowing autumn   1.1308 0.2176 5 5.20 0.0035
Sowing spring   0 . . . .
Sowing*Genotype autumn Chiaro -0.4017 0.1517 205 -2.65 0.0087
Sowing*Genotype autumn Collameno -0.1775 0.1517 205 -1.17 0.2434
Sowing*Genotype autumn Palomb 0.1825 0.1517 205 1.20 0.2304
Sowing*Genotype autumn Scuro -0.5933 0.1517 205 -3.91 0.0001
Sowing*Genotype autumn Sicania -0.4725 0.1517 205 -3.11 0.0021
Sowing*Genotype autumn Vesuvio 0 . . . .
Sowing*Genotype spring Chiaro 0 . . . .
Sowing*Genotype spring Collameno 0 . . . .
Sowing*Genotype spring Palomb 0 . . . .
Sowing*Genotype spring Scuro 0 . . . .
Sowing*Genotype spring Sicania 0 . . . .
Sowing*Genotype spring Vesuvio 0 . . . .
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Genotype 5 25 4.24 0.0063
Sowing 1 5 20.84 0.0060
Sowing*Genotype 5 205 7.73 <.0001

R

fit_allenv_allrandomslopes <- lmer(Yield ~ Genotype * Sowing + (1 + Genotype + Sowing | Environment) + (1 | Sowing:Block:Environment), 
                                   data = fava)
boundary (singular) fit: see help('isSingular')

Again, we get a singular fit message. Very likely, we are trying to fit a too-complex model to this small dataset.

VarCorr(fit_allenv_allrandomslopes)
 Groups                   Name              Std.Dev. Corr                
 Sowing:Block:Environment (Intercept)       0.119970                     
 Environment              (Intercept)       0.981421                     
                          GenotypeCollameno 0.314009 -0.107              
                          GenotypePalomb    0.245292  0.392 -0.144       
                          GenotypeScuro     0.168246  0.519 -0.810  0.641
                          GenotypeSicania   0.223958  0.920 -0.442  0.229
                          GenotypeVesuvio   0.071028  0.390 -0.481  0.937
                          Sowingspring      0.459510  0.325  0.876  0.250
 Residual                                   0.359447                     
                     
                     
                     
                     
                     
                     
  0.669              
  0.855  0.364       
 -0.433 -0.060 -0.086
                     

The variance-covariance matrix of the random effects doesn’t show any that are all zero. However, you can get a singular fit even when none of the individual random effects are zero. It can also occur when linear combinations of some of the random effects are equal to one another.

Let’s simplify by getting rid of the random slope for genotype and only keeping the random slope for sowing.

SAS

proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random intercept Sowing / subject = Environment solution;
    random intercept / subject = Sowing(Block Environment);
  ods select parameterestimates tests3;
    store fit_allenv_sowingrandomslope;
run;
Solutions for Fixed Effects
Effect Sowing Genotype Estimate Standard
Error
DF t Value Pr > |t|
Intercept     2.4146 0.4847 5 4.98 0.0042
Genotype   Chiaro 0.4333 0.1170 230 3.70 0.0003
Genotype   Collameno -0.05167 0.1170 230 -0.44 0.6592
Genotype   Palomb 0.02625 0.1170 230 0.22 0.8227
Genotype   Scuro 0.6850 0.1170 230 5.85 <.0001
Genotype   Sicania 0.3779 0.1170 230 3.23 0.0014
Genotype   Vesuvio 0 . . . .
Sowing autumn   1.1308 0.2217 5 5.10 0.0038
Sowing spring   0 . . . .
Sowing*Genotype autumn Chiaro -0.4017 0.1655 230 -2.43 0.0160
Sowing*Genotype autumn Collameno -0.1775 0.1655 230 -1.07 0.2845
Sowing*Genotype autumn Palomb 0.1825 0.1655 230 1.10 0.2712
Sowing*Genotype autumn Scuro -0.5933 0.1655 230 -3.59 0.0004
Sowing*Genotype autumn Sicania -0.4725 0.1655 230 -2.86 0.0047
Sowing*Genotype autumn Vesuvio 0 . . . .
Sowing*Genotype spring Chiaro 0 . . . .
Sowing*Genotype spring Collameno 0 . . . .
Sowing*Genotype spring Palomb 0 . . . .
Sowing*Genotype spring Scuro 0 . . . .
Sowing*Genotype spring Sicania 0 . . . .
Sowing*Genotype spring Vesuvio 0 . . . .
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
Genotype 5 230 9.78 <.0001
Sowing 1 5 20.84 0.0060
Sowing*Genotype 5 230 6.50 <.0001

R

fit_allenv_sowingrandomslope <- lmer(Yield ~ Genotype * Sowing + (1 + Sowing | Environment) + (1 | Sowing:Block:Environment), 
                                     data = fava)

This gets rid of the singular fit message. Let’s look at the random intercepts and slopes for environment.

ranef(fit_allenv_sowingrandomslope)$Environment
                  (Intercept) Sowingspring
badiola.2001-2002   0.3427129  -0.45877557
papiano.2001-2002   0.9006396  -0.21349718
badiola.2002-2003  -1.4271618  -0.07956543
papiano.2002-2003  -1.2176835  -0.30602556
badiola.2003-2004   0.6962424   0.38184826
papiano.2003-2004   0.7052505   0.67601548

This shows that the random slope for the spring sowing treatment is positive for the two 2003-2004 environments. When you look back at a plot of the raw data, this makes sense. The effect of spring sowing on yield is much less negative for those two environments. It is probably a good thing to include the sowing random slope term by environment.

Let’s be thorough and take a look at the model diagnostics for the model we are sticking with, that includes the sowing random slope but not genotype.

check_model(fit_allenv_sowingrandomslope, check = c('linearity', 'homogeneity', 'qq'))

model residual diagnostic plots for random slope model

We see good residual diagnostics across the board.

Compare models

Let’s confirm our intuition that the random slope model is a good one by comparing the models’ Akaike Information Criteria (AIC). In SAS, this is output by default. In R, we can use the AIC() function to compare several models’ AIC at once.

AIC(fit_allenv, fit_allenv_noblock, fit_allenv_allrandomslopes, fit_allenv_sowingrandomslope)
                             df      AIC
fit_allenv                   16 432.2706
fit_allenv_noblock           15 430.2706
fit_allenv_allrandomslopes   42 413.9765
fit_allenv_sowingrandomslope 17 409.8349

We see that the lowest AIC (best fit) is the sowing random slope model, followed by the model including both random slopes. The random-intercept-only models are inferior.

PROTIP: AIC is a relative measure. It can only say what model is a better fit to the data out of the ones you are choosing from. It doesn’t say which one is “right.” For one thing, “all models are wrong,” and for another thing, we might be comparing a set of terrible models! Humility is important when you are doing stats!

Estimated marginal means

A great innovation of the statisticians at SAS was the development of least-square means (lsmeans). Almost all the SAS model fitting PROCs have a lsmeans statement. You can use it to get estimated values of treatment means at different levels of other fixed effects, and averaged over the random variation. It gives you confidence limits using an approximation method to estimate the degrees of freedom, and lets you do pairwise comparisons with proper p-value adjustment for multiple comparisons.

Statistician Russ Lenth from the University of Iowa implemented all of this same functionality, and then some, in the R package emmeans (estimated marginal means). Estimated marginal means is synonymous with least-square means (it may be a more appropriate term because least-square method isn’t always used to estimate them).

NOTE: I separated out the SAS code for lsmeans into a separate proc plm which allows you to separate model fitting and post-fitting analyses. That’s a good practice to keep in mind as a SAS user. We had to include a store statement in our call to proc glimmix so the model object can be restored and used later in proc plm.

SAS

proc plm restore = fit_allenv_sowingrandomslope;
    lsmeans Genotype Sowing / diff linestable adjust = sidak;
    slice Genotype * Sowing / sliceby = Sowing diff linestable adjust = sidak;
    ods select lsmeans lsmlines slicediffs slicelines;
run;

NOTE: PROCEDURE PLM used (Total process time):
      real time           0.96 seconds
      cpu time            0.34 seconds
      
Genotype Least Squares Means
Genotype Estimate Standard Error DF t Value Pr > |t|
Chiaro 3.2125 0.4719 230 6.81 <.0001
Collameno 2.8396 0.4719 230 6.02 <.0001
Palomb 3.0975 0.4719 230 6.56 <.0001
Scuro 3.3683 0.4719 230 7.14 <.0001
Sicania 3.1217 0.4719 230 6.62 <.0001
Vesuvio 2.9800 0.4719 230 6.31 <.0001
Sidak Grouping for Genotype Least Squares Means (Alpha=0.05)
LS-means with the same letter are not significantly different.
Genotype Estimate  
Scuro 3.3683   A
      A
Chiaro 3.2125 B A
    B  
Sicania 3.1217 B  
    B  
Palomb 3.0975 B  
    B  
Vesuvio 2.9800 B C
      C
Collameno 2.8396   C
Sowing Least Squares Means
Sowing Estimate Standard Error DF t Value Pr > |t|
autumn 3.5468 0.4788 5 7.41 0.0007
spring 2.6597 0.4788 5 5.55 0.0026
Sidak Grouping for Sowing Least Squares Means (Alpha=0.05)
LS-means with the same letter are not significantly different.
Sowing Estimate  
autumn 3.5468 A
     
spring 2.6597 B
Simple Differences of Sowing*Genotype Least Squares Means
Adjustment for Multiple Comparisons: Sidak
Slice Genotype _Genotype Estimate Standard Error DF t Value Pr > |t| Adj P
Sowing autumn Chiaro Collameno 0.2608 0.1170 230 2.23 0.0268 0.3342
Sowing autumn Chiaro Palomb -0.1771 0.1170 230 -1.51 0.1315 0.8794
Sowing autumn Chiaro Scuro -0.06000 0.1170 230 -0.51 0.6086 1.0000
Sowing autumn Chiaro Sicania 0.1262 0.1170 230 1.08 0.2817 0.9930
Sowing autumn Chiaro Vesuvio 0.03167 0.1170 230 0.27 0.7869 1.0000
Sowing autumn Collameno Palomb -0.4379 0.1170 230 -3.74 0.0002 0.0034
Sowing autumn Collameno Scuro -0.3208 0.1170 230 -2.74 0.0066 0.0943
Sowing autumn Collameno Sicania -0.1346 0.1170 230 -1.15 0.2512 0.9870
Sowing autumn Collameno Vesuvio -0.2292 0.1170 230 -1.96 0.0514 0.5465
Sowing autumn Palomb Scuro 0.1171 0.1170 230 1.00 0.3180 0.9968
Sowing autumn Palomb Sicania 0.3033 0.1170 230 2.59 0.0101 0.1417
Sowing autumn Palomb Vesuvio 0.2087 0.1170 230 1.78 0.0757 0.6930
Sowing autumn Scuro Sicania 0.1862 0.1170 230 1.59 0.1128 0.8338
Sowing autumn Scuro Vesuvio 0.09167 0.1170 230 0.78 0.4341 0.9998
Sowing autumn Sicania Vesuvio -0.09458 0.1170 230 -0.81 0.4197 0.9997
Sidak Grouping for Sowing*Genotype Least Squares Means Slice (Alpha=0.05)
LS-means with the same letter are not significantly different.
Slice Genotype Estimate  
Sowing autumn Palomb 3.7542   A
Sowing autumn       A
Sowing autumn Scuro 3.6371 B A
Sowing autumn     B A
Sowing autumn Chiaro 3.5771 B A
Sowing autumn     B A
Sowing autumn Vesuvio 3.5454 B A
Sowing autumn     B A
Sowing autumn Sicania 3.4508 B A
Sowing autumn     B  
Sowing autumn Collameno 3.3162 B  
Simple Differences of Sowing*Genotype Least Squares Means
Adjustment for Multiple Comparisons: Sidak
Slice Genotype _Genotype Estimate Standard Error DF t Value Pr > |t| Adj P
Sowing spring Chiaro Collameno 0.4850 0.1170 230 4.15 <.0001 0.0007
Sowing spring Chiaro Palomb 0.4071 0.1170 230 3.48 0.0006 0.0090
Sowing spring Chiaro Scuro -0.2517 0.1170 230 -2.15 0.0325 0.3909
Sowing spring Chiaro Sicania 0.05542 0.1170 230 0.47 0.6362 1.0000
Sowing spring Chiaro Vesuvio 0.4333 0.1170 230 3.70 0.0003 0.0040
Sowing spring Collameno Palomb -0.07792 0.1170 230 -0.67 0.5061 1.0000
Sowing spring Collameno Scuro -0.7367 0.1170 230 -6.30 <.0001 <.0001
Sowing spring Collameno Sicania -0.4296 0.1170 230 -3.67 0.0003 0.0045
Sowing spring Collameno Vesuvio -0.05167 0.1170 230 -0.44 0.6592 1.0000
Sowing spring Palomb Scuro -0.6587 0.1170 230 -5.63 <.0001 <.0001
Sowing spring Palomb Sicania -0.3517 0.1170 230 -3.01 0.0029 0.0432
Sowing spring Palomb Vesuvio 0.02625 0.1170 230 0.22 0.8227 1.0000
Sowing spring Scuro Sicania 0.3071 0.1170 230 2.62 0.0093 0.1302
Sowing spring Scuro Vesuvio 0.6850 0.1170 230 5.85 <.0001 <.0001
Sowing spring Sicania Vesuvio 0.3779 0.1170 230 3.23 0.0014 0.0211
Sidak Grouping for Sowing*Genotype Least Squares Means Slice (Alpha=0.05)
LS-means with the same letter are not significantly different.
Slice Genotype Estimate  
Sowing spring Scuro 3.0996 A
Sowing spring     A
Sowing spring Chiaro 2.8479 A
Sowing spring     A
Sowing spring Sicania 2.7925 A
Sowing spring      
Sowing spring Palomb 2.4408 B
Sowing spring     B
Sowing spring Vesuvio 2.4146 B
Sowing spring     B
Sowing spring Collameno 2.3629 B

R

We provide the emmeans() function first with the name of the fitted model object, then a one-sided formula with the fixed effects that we want to calculate means for on the right-hand side of the ~. If you want multiple effects crossed, use +. If you want to calculate the means for one treatment within another, use | for the grouping factor. (The difference will be clear when we do pairwise contrasts).

emm_geno <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype)
NOTE: Results may be misleading due to involvement in interactions
emm_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Sowing)
NOTE: Results may be misleading due to involvement in interactions
emm_geno_x_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype + Sowing)
emm_geno_by_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype | Sowing)

Type the name of the object to print a handy summary including 95% confidence intervals estimated with the Kenward-Roger method of degrees of freedom approximation.

emm_geno
 Genotype  emmean    SE   df lower.CL upper.CL
 Chiaro      3.21 0.472 5.13     2.01     4.42
 Collameno   2.84 0.472 5.13     1.64     4.04
 Palomb      3.10 0.472 5.13     1.89     4.30
 Scuro       3.37 0.472 5.13     2.16     4.57
 Sicania     3.12 0.472 5.13     1.92     4.33
 Vesuvio     2.98 0.472 5.13     1.78     4.18

Results are averaged over the levels of: Sowing 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
emm_sowing
 Sowing emmean    SE df lower.CL upper.CL
 autumn   3.55 0.429  5     2.44     4.65
 spring   2.66 0.524  5     1.31     4.01

Results are averaged over the levels of: Genotype 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
emm_geno_x_sowing
 Genotype  Sowing emmean    SE   df lower.CL upper.CL
 Chiaro    autumn   3.58 0.435 5.31     2.48     4.68
 Collameno autumn   3.32 0.435 5.31     2.22     4.42
 Palomb    autumn   3.75 0.435 5.31     2.65     4.85
 Scuro     autumn   3.64 0.435 5.31     2.54     4.74
 Sicania   autumn   3.45 0.435 5.31     2.35     4.55
 Vesuvio   autumn   3.55 0.435 5.31     2.45     4.65
 Chiaro    spring   2.85 0.529 5.21     1.50     4.19
 Collameno spring   2.36 0.529 5.21     1.02     3.71
 Palomb    spring   2.44 0.529 5.21     1.10     3.79
 Scuro     spring   3.10 0.529 5.21     1.76     4.44
 Sicania   spring   2.79 0.529 5.21     1.45     4.14
 Vesuvio   spring   2.41 0.529 5.21     1.07     3.76

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
emm_geno_by_sowing
Sowing = autumn:
 Genotype  emmean    SE   df lower.CL upper.CL
 Chiaro      3.58 0.435 5.31     2.48     4.68
 Collameno   3.32 0.435 5.31     2.22     4.42
 Palomb      3.75 0.435 5.31     2.65     4.85
 Scuro       3.64 0.435 5.31     2.54     4.74
 Sicania     3.45 0.435 5.31     2.35     4.55
 Vesuvio     3.55 0.435 5.31     2.45     4.65

Sowing = spring:
 Genotype  emmean    SE   df lower.CL upper.CL
 Chiaro      2.85 0.529 5.21     1.50     4.19
 Collameno   2.36 0.529 5.21     1.02     3.71
 Palomb      2.44 0.529 5.21     1.10     3.79
 Scuro       3.10 0.529 5.21     1.76     4.44
 Sicania     2.79 0.529 5.21     1.45     4.14
 Vesuvio     2.41 0.529 5.21     1.07     3.76

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 

Next, do pairwise treatment contrasts. For the ones involving genotype, there is more than one comparison. Thus we adjust the p-values using the Sidak adjustment for multiple comparisons. Feel free to use your preferred method (Tukey, Holm, Bonferroni, etc.)

contrast(emm_geno, 'pairwise', adjust = 'sidak')
 contrast            estimate     SE  df t.ratio p.value
 Chiaro - Collameno    0.3729 0.0827 230   4.508  0.0002
 Chiaro - Palomb       0.1150 0.0827 230   1.390  0.9341
 Chiaro - Scuro       -0.1558 0.0827 230  -1.884  0.6102
 Chiaro - Sicania      0.0908 0.0827 230   1.098  0.9917
 Chiaro - Vesuvio      0.2325 0.0827 230   2.810  0.0777
 Collameno - Palomb   -0.2579 0.0827 230  -3.118  0.0304
 Collameno - Scuro    -0.5288 0.0827 230  -6.391  <.0001
 Collameno - Sicania  -0.2821 0.0827 230  -3.410  0.0115
 Collameno - Vesuvio  -0.1404 0.0827 230  -1.697  0.7609
 Palomb - Scuro       -0.2708 0.0827 230  -3.274  0.0182
 Palomb - Sicania     -0.0242 0.0827 230  -0.292  1.0000
 Palomb - Vesuvio      0.1175 0.0827 230   1.420  0.9227
 Scuro - Sicania       0.2467 0.0827 230   2.982  0.0466
 Scuro - Vesuvio       0.3883 0.0827 230   4.694  0.0001
 Sicania - Vesuvio     0.1417 0.0827 230   1.712  0.7495

Results are averaged over the levels of: Sowing 
Degrees-of-freedom method: kenward-roger 
P value adjustment: sidak method for 15 tests 
contrast(emm_sowing, 'pairwise')
 contrast        estimate    SE df t.ratio p.value
 autumn - spring    0.887 0.194  5   4.565  0.0060

Results are averaged over the levels of: Genotype 
Degrees-of-freedom method: kenward-roger 

I won’t print the results of the pairwise contrasts for multiple factors here, but try these out to see how they differ.

contrast(emm_geno_x_sowing, 'pairwise', adjust = 'sidak')
contrast(emm_geno_by_sowing, 'pairwise', adjust = 'sidak')

You may also want to get multiple comparison letters to concisely summarize the significant differences (this is done with the lines and/or linestable options in the lsmeans statement in SAS). In R, the multcomp package has a function called cld(), for “compact letter display.” The first argument to cld() is an emmeans object. We also provide our favorite multiple-comparison adjustment method with the adjust argument. The argument Letters = letters makes the group labels a,b,c... instead of 1,2,3....

cld(emm_geno, adjust = 'sidak', Letters = letters)
 Genotype  emmean    SE   df lower.CL upper.CL .group
 Collameno   2.84 0.472 5.13    0.886     4.79  a    
 Vesuvio     2.98 0.472 5.13    1.026     4.93  ab   
 Palomb      3.10 0.472 5.13    1.144     5.05   b   
 Sicania     3.12 0.472 5.13    1.168     5.08   b   
 Chiaro      3.21 0.472 5.13    1.259     5.17   bc  
 Scuro       3.37 0.472 5.13    1.415     5.32    c  

Results are averaged over the levels of: Sowing 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: sidak method for 6 estimates 
P value adjustment: sidak method for 15 tests 
significance level used: alpha = 0.05 
NOTE: If two or more means share the same grouping symbol,
      then we cannot show them to be different.
      But we also did not show them to be the same. 
cld(emm_geno_by_sowing, adjust = 'sidak', Letters = letters)
Sowing = autumn:
 Genotype  emmean    SE   df lower.CL upper.CL .group
 Collameno   3.32 0.435 5.31    1.545     5.09  a    
 Sicania     3.45 0.435 5.31    1.680     5.22  ab   
 Vesuvio     3.55 0.435 5.31    1.775     5.32  ab   
 Chiaro      3.58 0.435 5.31    1.806     5.35  ab   
 Scuro       3.64 0.435 5.31    1.866     5.41  ab   
 Palomb      3.75 0.435 5.31    1.983     5.53   b   

Sowing = spring:
 Genotype  emmean    SE   df lower.CL upper.CL .group
 Collameno   2.36 0.529 5.21    0.188     4.54  a    
 Vesuvio     2.41 0.529 5.21    0.240     4.59  a    
 Palomb      2.44 0.529 5.21    0.266     4.62  a    
 Sicania     2.79 0.529 5.21    0.618     4.97   b   
 Chiaro      2.85 0.529 5.21    0.673     5.02   b   
 Scuro       3.10 0.529 5.21    0.925     5.27   b   

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: sidak method for 6 estimates 
P value adjustment: sidak method for 15 tests 
significance level used: alpha = 0.05 
NOTE: If two or more means share the same grouping symbol,
      then we cannot show them to be different.
      But we also did not show them to be the same. 

PROTIP: You can call the function plot() and pass any of your emmeans or contrast objects to it as an argument. There are built-in methods that create plots of the means or contrasts, and their 95% confidence intervals. I will not get into that here. In future lessons we will take a closer look at those plots and how to customize them.

Conclusion

What have you learned in this lesson?

  • How to fit linear mixed models to data from split-plot experiments
  • How to fit linear mixed models to data from experiments repeated at multiple locations
  • How to fit linear mixed models with different random intercepts and random slopes
  • How to diagnose and deal with singular fits
  • How to generate estimated marginal means and compare them with post-hoc hypothesis tests

Impressive!

This was Lesson 2 in a series. Future lessons will cover:

  • generalized linear mixed models with different error distributions
  • making plots of model predictions, including estimated marginal means
  • an even wider variety of experimental designs
  • generation of reports and documents using RMarkdown

Exercises

As in the previous lesson, here are some exercises for additional practice. These exercises give you some SAS code and ask you to write some R code that does the same job or close to it.

For these exercises, we will be using a classic dataset from the notes of Gertrude Cox, a pioneering statistician who incidentally was the first female professor at N.C. State University! This dataset comes from a strip-split-plot experiment examining how soil type, fertilizer, and calcium affect barley yield.

The barley dataset is provided with the R package agridat. You can load it into your R workspace by calling the function data() after loading the package.

library(agridat)
data('cox.stripsplit')

In the experiment, there are four replicated blocks. Within each block, there are four vertical strips each with a different fertilizer treatment. Each strip is split in half to yield two subplots per strip, each with a different level of calcium. Perpendicularly to the fertilizer strips there are horizontal strips with different soil types.

To help visualize, this is what one of the four blocks might look like. Note that the fertilizer (F, shown as different colors) is applied in vertical strips in a random order and the soil type (S, shown as different shadings) is applied perpendicularly. Within each plot formed by the intersection of two strips, there is a subplot for each calcium (C) treatment, arranged randomly. The other three blocks would look similar but with different randomized locations of F, S, and C levels.

More details can be found in the help documentation for the dataset. Type ?cox.stripsplit.

Exercise 1

  1. What are the fixed effects?
  2. What are the random effects? To help think about this, imagine a measurement from a single subplot (combination of fertilizer, soil, and calcium treatment within a rep). What other subplot measurements would this measurement be related to, for reasons other than sharing a treatment?

Exercise 2

Specify the model with the fixed and random effects we identified in Exercise 1. Include only random intercept effects, no random slopes. The model and random statements to fit the same model using SAS proc glimmix are:

model yield = soil|fert|calcium / ddfm=kr;
random rep fert(rep) soil(rep) fert(soil*rep);
  • Hint 1: There will be three fixed effects separated with *, which will estimate main effects and all interactions.
  • Hint 2: There will be four random effect terms. They will all have the form (1 | <something>).

Exercise 3

Did you get any warning messages when fitting the model? Can you diagnose the issue? What, if anything, should you do about it?

Exercise 4

Fit the model again, this time adding a random slope for the main effect of calcium, grouped by rep. This corresponds to the following two random statements in proc glimmix:

random fert(rep) soil(rep);
random intercept calcium / subject=rep;
  • Hint: Include + calcium somewhere in the random effects part of the model formula.

What happens when you do this? Can you diagnose the issue?

Exercise 5

Write code that gets a similar result as you would get if you inserted the following slice statement into your SAS proc glimmix code, for the model with only random intercepts.

slice fert*calcium / sliceby=calcium diff cl linestable adjust=tukey adjdfe=row;

Your code should do the following:

  1. Find estimated marginal means for each fertilizer treatment level, separately by calcium treatment level.
  2. Take pairwise comparisons between these means, using the Tukey adjustment.
  3. Generate the multiple comparisons letters. What can we say about the effect of fertilizer and how it differs by calcium level?
  • Hint 1: You will need to call emmeans(), contrast(), and cld().
  • Hint 2: You will need to include adjust = 'tukey' in two of the three function calls.

Click here for answers