This workshop is intended for SAS users who want to learn R. The people who will get the most out of this course are practicing researchers who have a decent working knowledge of SAS, and of basic statistical analysis (descriptive stats and regression models) as it applies to their field.
This is lesson 2 in a series. Lesson 1 covered the basics: importing data, cleaning and reshaping data, summary statistics, simple graphs and tables, and a few simple statistical models.
During this workshop, you will …
As in Lesson 1, we will work through a “data to doc” pipeline in R, comparing R code to SAS code each step of the way. We will use a different dataset this time.
We will …
... with
code)As in the previous lesson, we will start with raw data and work our way to a finished product. The first few steps of the pipeline will not be completely new to you if you did Lesson 1 … but it is good to get some extra practice!
Here we’ll load the R packages we are going to work with today. These are mostly the same as Lesson 1. This includes the tidyverse package for reading, manipulating, and plotting data, the lme4 package for fitting linear mixed models, and the easystats package which has some good model diagnostic plots. Also set a default plotting theme.
Note about packages: R has lots of mixed model fitting packages. This fits in with the idea that the best thing about R is “there are many ways to do something,” and the worst thing about R is also “there are many ways to do something.” I would say that nowadays the lme4 package is probably the most widely used package for fitting linear mixed models, so it’s important to be familiar with it if you are doing stats with R. The nlme package also has some useful capabilities so I would also recommend familiarizing yourself with it. Even more complex models can be fit with glmmTMB and other packages. Ultimately, you can fit multilevel models of any level of sophistication using Bayesian methods with packages like brms.
library(tidyverse)
library(lme4)
library(easystats)
library(lmerTest)
library(emmeans)
library(multcomp)
theme_set(theme_bw())In this lesson, we’re going to use a dataset kindly provided by Andrea Onofri, a regular contributor to r-bloggers. This tutorial is loosely based on this blog post.
fava beans growing in the field … and cooked into some
delicious ful medames! (Images by Josep Gesti and News from
Jerusalem)
Today we are going to explore the wonderful world of fava beans! This
dataset, fava.csv, includes the results of a study
comparing the performance (yield) of six different fava bean genotypes
under two different sowing treatments: spring and autumn sowing.
Let’s try to understand the experimental design. This study was designed as a split-plot experiment with multiple replicates (blocks). Sowing time (spring and autumn) is the main plot factor and genotype (Chiaro, Collameno, Palomb, Scuro, Sicania, and Vesuvio) is the subplot factor. There are 2 sowing treatments and 6 genotypes, for a total of 12 genotype-treatment combinations. Furthermore, the experiment was repeated in two locations in Italy (Badiola and Papiano) for three years each (2001-2002, 2002-2003, and 2003-2004).
What does split-plot mean? This means that each block consists of two main plots (one sown in spring and the other in autumn), and each of those main plots is divided up into six subplots, with a different genotype planted in each one. There are four blocks at each of the locations. For the purposes of this lesson, we will consider each combination of location and year (six in all) to be a unique “environment.” It is crucial that we have replication within each environment, as well as repetition of the experiment in different environments so that we can determine how consistent the relative performance of the genotypes is under the different treatments, as environmental conditions vary.
The layout of the experiment at one of the two locations in one of the three seasons might look like the map below. Notice all plots sharing a sowing time are contiguous because the sowing main plots are split into genotype subplots, and notice that all twelve treatment combinations are present in each block. Also notice the randomized order of both treatments. This same setup with different randomized positions is repeated in each of the two locations in each of the three years.
As in Lesson 1, the example data for this lesson are hosted on
GitHub, or in the data directory if you are on the cloud
server.
filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv";
proc import datafile = csvFile out = fava replace dbms = csv; run;If you are running this on the cloud server:
fava <- read_csv('data/fava.csv')Rows: 288 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Location, Year, Sowing, Genotype
dbl (2): Block, Yield
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.If you are running this on your own machine:
fava <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv')In SAS, we would use proc print and
proc contents to examine our dataset. In R tidyverse, we
can type the name of the dataframe into the console to see the first few
rows and print the data type of each column. The function
glimpse() also provides a concise summary.
proc print data = fava(obs = 10); run;
ods select variables;
proc contents data = fava; run;| Obs | Location | Year | Sowing | Genotype | Block | Yield | 
|---|---|---|---|---|---|---|
| 1 | papiano | 2002-2003 | autumn | Chiaro | 1 | 2.05 | 
| 2 | papiano | 2002-2003 | autumn | Chiaro | 2 | 2.5 | 
| 3 | papiano | 2002-2003 | autumn | Chiaro | 3 | 2.64 | 
| 4 | papiano | 2002-2003 | autumn | Chiaro | 4 | 2.45 | 
| 5 | papiano | 2002-2003 | autumn | Collameno | 1 | 2.01 | 
| 6 | papiano | 2002-2003 | autumn | Collameno | 2 | 2.19 | 
| 7 | papiano | 2002-2003 | autumn | Collameno | 3 | 2.38 | 
| 8 | papiano | 2002-2003 | autumn | Collameno | 4 | 2.35 | 
| 9 | papiano | 2002-2003 | autumn | Palomb | 2 | 2.38 | 
| 10 | papiano | 2002-2003 | autumn | Palomb | 1 | 2.42 | 
| Alphabetic List of Variables and Attributes | |||||
|---|---|---|---|---|---|
| # | Variable | Type | Len | Format | Informat | 
| 5 | Block | Num | 8 | BEST12. | BEST32. | 
| 4 | Genotype | Char | 11 | $11. | $11. | 
| 1 | Location | Char | 9 | $9. | $9. | 
| 3 | Sowing | Char | 8 | $8. | $8. | 
| 2 | Year | Char | 11 | $11. | $11. | 
| 6 | Yield | Num | 8 | BEST12. | BEST32. | 
fava# A tibble: 288 × 6
   Location Year      Sowing Genotype  Block Yield
   <chr>    <chr>     <chr>  <chr>     <dbl> <dbl>
 1 papiano  2002-2003 autumn Chiaro        1  2.05
 2 papiano  2002-2003 autumn Chiaro        2  2.5 
 3 papiano  2002-2003 autumn Chiaro        3  2.64
 4 papiano  2002-2003 autumn Chiaro        4  2.45
 5 papiano  2002-2003 autumn Collameno     1  2.01
 6 papiano  2002-2003 autumn Collameno     2  2.19
 7 papiano  2002-2003 autumn Collameno     3  2.38
 8 papiano  2002-2003 autumn Collameno     4  2.35
 9 papiano  2002-2003 autumn Palomb        2  2.38
10 papiano  2002-2003 autumn Palomb        1  2.42
# ℹ 278 more rowsglimpse(fava)Rows: 288
Columns: 6
$ Location <chr> "papiano", "papiano", "papiano", "papiano", "papiano", "papia…
$ Year     <chr> "2002-2003", "2002-2003", "2002-2003", "2002-2003", "2002-200…
$ Sowing   <chr> "autumn", "autumn", "autumn", "autumn", "autumn", "autumn", "…
$ Genotype <chr> "Chiaro", "Chiaro", "Chiaro", "Chiaro", "Collameno", "Collame…
$ Block    <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2…
$ Yield    <dbl> 2.05, 2.50, 2.64, 2.45, 2.01, 2.19, 2.38, 2.35, 2.38, 2.42, 2…Let’s go ahead and convert the Location, Year, Sowing, Genotype, and
Block columns to factors because they are all discrete. This isn’t
strictly necessary for the ones other than Block because R will
automatically convert the character variables to factors when fitting
the models, but it is a good practice. We only need to do this in R
right now because in SAS this is done using the class
statement in the modeling proc.
Also, let’s create a new column called Environment which
is the combination of Location and Year,
because we’re considering location-year combinations to be unique
environments for this analysis. The interaction function
combines two or more factors to produce a new factor. We’ll do this
in both SAS and R. In SAS it’s done with a data
step.
data fava; set fava;
    Environment = catx('_', Location, Year);
run; The mutate() function is used to transform existing
columns or create new ones. We do both here. Note that we assign the
result back to fava, which “overwrites” the dataframe with
our modified one. The pipe operator %>% is used to put
each data-manipulation operation on its own line for better code
readability.
fava <- fava %>%
  mutate(
    across(Location:Block, factor),
    Environment = interaction(Location, Year)
  )Notice the across() function is used to apply the same
function, factor(), to many columns. We can use the
: shorthand to apply the function to a block of contiguous
columns.
Let’s take a look at the data. In SAS we can use
proc sgplot with a vbox statement to make
boxplots, or proc sgpanel if we want multiple panels. In R
with the ggplot2 package, we use
geom_boxplot() to draw boxplots, the group
aesthetic mapping to make multiple boxplots split up by a grouping
variable, and facet_grid() to make a grid of panels split
up by additional variables, each with their own boxplots.
proc sgplot data = fava;
    vbox Yield / group = Genotype;
run; 
proc sgplot data = fava;
    vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;
    
proc sgpanel data = fava;
    panelby Environment;
    vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;ggplot(fava, aes(x = Genotype, y = Yield)) + 
  geom_boxplot()ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) + 
  geom_boxplot()ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) + 
  geom_boxplot() +
  facet_grid(Location ~ Year)PROTIP: I won’t get into plots that much in this lesson, but check out my ggplot2 lesson to learn more.
The first plot shows that when we lump all the data together, the genotypes’ yield distributions are pretty similar. But we have to disaggregate things a little bit to start to see patterns. First, when we split the boxplots for each genotype up by sowing treatment, we can see a consistent pattern where autumn sowing leads to increased yield, with roughly the same effect in all six genotypes. The final plot, “faceted” by location and year, starts to really get interesting when we drill down (real datasets are like that)! We see that the sowing time effect is present in 2001-02 and 2002-03, but seems to be much reduced in 2003-04 in both locations, maybe due to climate variation. Yield was much lower in 2002-03 for both locations. There seems to be a hint of genotype by environment interaction as well.
All of this should make it clear how important it is to repeat experiments under different environmental conditions, and show us that our statistical model will have to account for interactions between genotype and treatment, and for effects that vary between environments.
When we are fitting models, a good way to proceed is to start simple and gradually build up to the final model we want. In this case, let’s start by subsetting the data to one environment only, and see if we can successfully fit a model.
We’ll start even simpler by only fitting the block model (random effects only) and ignoring the treatments for now.
Here, we use a data step to subset a single environment
(Papiano in 2002-03), then fit a model. For all the SAS model-fitting
code in this lesson, proc glimmix is a good option.
NOTE: By default,
proc glimmixproduces a lot of output. I have selectively displayed only a little of that output for comparison purposes, usingods selectstatements.
data papiano0203; set fava;
    where Location = 'papiano' & Year = '2002-2003';
run;
proc glimmix data = papiano0203;
    class Environment Block Sowing Genotype;
    model Yield = / solution;
    random Block Sowing(Block) / solution;
  ods select covparms parameterestimates tests3 solutionr;
run;| Covariance Parameter Estimates | ||
|---|---|---|
| Cov Parm | Estimate | Standard Error | 
| Block | 0 | . | 
| Sowing(Block) | 0.4051 | 0.2263 | 
| Residual | 0.1094 | 0.02445 | 
| Solutions for Fixed Effects | |||||
|---|---|---|---|---|---|
| Effect | Estimate | Standard Error | DF | t Value | Pr > |t| | 
| Intercept | 1.7279 | 0.2300 | 3 | 7.51 | 0.0049 | 
| Solution for Random Effects | |||||||
|---|---|---|---|---|---|---|---|
| Effect | Sowing | Block | Estimate | Std Err Pred | DF | t Value | Pr > |t| | 
| Block | 1 | 0 | . | . | . | . | |
| Block | 2 | 0 | . | . | . | . | |
| Block | 3 | 0 | . | . | . | . | |
| Block | 4 | 0 | . | . | . | . | |
| Sowing(Block) | autumn | 1 | 0.3465 | 0.2567 | 40 | 1.35 | 0.1847 | 
| Sowing(Block) | spring | 1 | -0.6551 | 0.2567 | 40 | -2.55 | 0.0146 | 
| Sowing(Block) | autumn | 2 | 0.5522 | 0.2567 | 40 | 2.15 | 0.0376 | 
| Sowing(Block) | spring | 2 | -0.5562 | 0.2567 | 40 | -2.17 | 0.0363 | 
| Sowing(Block) | autumn | 3 | 0.6655 | 0.2567 | 40 | 2.59 | 0.0133 | 
| Sowing(Block) | spring | 3 | -0.5052 | 0.2567 | 40 | -1.97 | 0.0560 | 
| Sowing(Block) | autumn | 4 | 0.7245 | 0.2567 | 40 | 2.82 | 0.0074 | 
| Sowing(Block) | spring | 4 | -0.5722 | 0.2567 | 40 | -2.23 | 0.0315 | 
We will use the lmer() function from the
lme4 package to fit the model in R. The model formula
syntax is a little different than in SAS.
Yield) is on the
left side of the formula.~ separates the dependent from the independent
variables.+.1).|.1 on the design side means only fit
random intercepts and no random slopes.(1 | Block), we are saying each block will
get its own random intercept.(1 | Sowing:Block), we are saying each main
plot (defined by unique combinations of Sowing and Block) will get its
own random intercept.The data argument specifies the data frame that has the
variables in the formula. Here we’re also using a subset
argument to specify we are only fitting the model to the rows of the
dataframe fava where Location is equal to the
string 'papiano' and Year is equal to the
string '2002-2003' (multiple logical conditions are
separated with &).
fit_1env <- lmer(Yield ~ 1 + (1 | Block) + (1 | Sowing:Block), 
                 data = fava, 
                 subset = Location == 'papiano' & Year == '2002-2003')Next let’s look at some of the output.
summary(fit_1env)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Yield ~ 1 + (1 | Block) + (1 | Sowing:Block)
   Data: fava
 Subset: Location == "papiano" & Year == "2002-2003"
REML criterion at convergence: 55.3
Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.3438 -0.6279 -0.2337  0.6757  2.1901 
Random effects:
 Groups       Name        Variance  Std.Dev. 
 Sowing:Block (Intercept) 4.051e-01 6.365e-01
 Block        (Intercept) 9.150e-09 9.566e-05
 Residual                 1.094e-01 3.307e-01
Number of obs: 48, groups:  Sowing:Block, 8; Block, 4
Fixed effects:
            Estimate Std. Error    df t value Pr(>|t|)    
(Intercept)    1.728      0.230 7.000   7.511 0.000136 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1ranef(fit_1env)$`Sowing:Block`
         (Intercept)
autumn:1   0.3464950
autumn:2   0.5522388
autumn:3   0.6654777
autumn:4   0.7244895
spring:1  -0.6551107
spring:2  -0.5562261
spring:3  -0.5051889
spring:4  -0.5721752
$Block
    (Intercept)
1 -6.969903e-09
2 -9.005043e-11
3  3.620027e-09
4  3.439927e-09
with conditional variances for "Sowing:Block" "Block" icc(fit_1env, by_group = TRUE)# ICC by Group
Group        |       ICC
------------------------
Sowing:Block |     0.787
Block        | 1.778e-08The summary() function gives you a summary of the
residuals, the variance of the different random effects, and the fixed
effects, very similar to what you get from proc glimmix
output. The random effects are block (Block), main plot
(Sowing:Block), and subplot (Residual because
each data point is a subplot, so the model residuals are the errors for
each individual subplot). The fixed effect is the global intercept.
In R, ranef() is the function that displays the random
effects. This is equivalent to adding the solution option
to the random statement in SAS. When we call
ranef(), we see that some main plots have higher yield than
others. However the random intercepts for blocks are very close to zero.
This indicates that most of the variation is between main plots, with
very little systematic differences between the blocks.
In R the icc() function provides the intraclass
correlation coefficient separated by random effect term. We see that
around 80% of the variation is at the main plot level at this particular
environment (we haven’t included any fixed effects yet so this is just
telling us what level the variation is found at). Almost no additional
variation is explained by the block term (as we could have guessed from
looking at ranef() output). The remaining 20% is residual
variation at the subplot level. Because every data point is an
individual subplot, it is not needed to put a special subplot term into
the model. The model residuals, after accounting for any fixed effects,
the block term, and the main plot term, represent the subplot error.
PROTIP: It is fairly straightforward to calculate the ICC in SAS by extracting the variance components from the model and taking ratios, but unfortunately there is no statement in PROC GLIMMIX that does it directly. Thus I didn’t include the ICC in the SAS results here. You can obtain the value 0.787 for the ICC of main plot using the ratio of the variance due to main plot to the total variance:
0.4051 / (0.4051 + 0.1094) = 0.787. The ICC of block is effectively 0.
Let’s add some fixed effects for genotype, sowing treatment, and
their interaction. The fixed effect portion of the model formula in R
can be written as Genotype + Sowing + Genotype:Sowing,
where the : indicates an interaction. Or it can be written
using the shorthand Genotype * Sowing, where *
indicates all combinations of main effects and interaction terms. The
: is equivalent to * in SAS, and the
* is equivalent to | (a little confusing)!
proc glimmix data = papiano0203;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Block Sowing(Block) / solution;
  ods select covparms parameterestimates tests3 solutionr;
run;| Covariance Parameter Estimates | ||
|---|---|---|
| Cov Parm | Estimate | Standard Error | 
| Block | 0.008972 | 0.01123 | 
| Sowing(Block) | 0.003253 | 0.006894 | 
| Residual | 0.03023 | 0.007805 | 
| Solutions for Fixed Effects | |||||||
|---|---|---|---|---|---|---|---|
| Effect | Sowing | Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | 
| Intercept | 0.9375 | 0.1030 | 3 | 9.10 | 0.0028 | ||
| Genotype | Chiaro | 0.4675 | 0.1229 | 30 | 3.80 | 0.0007 | |
| Genotype | Collameno | 0.04750 | 0.1229 | 30 | 0.39 | 0.7020 | |
| Genotype | Palomb | -0.05500 | 0.1229 | 30 | -0.45 | 0.6578 | |
| Genotype | Scuro | 0.8100 | 0.1229 | 30 | 6.59 | <.0001 | |
| Genotype | Sicania | -0.1150 | 0.1229 | 30 | -0.94 | 0.3570 | |
| Genotype | Vesuvio | 0 | . | . | . | . | |
| Sowing | autumn | 1.3275 | 0.1294 | 3 | 10.26 | 0.0020 | |
| Sowing | spring | 0 | . | . | . | . | |
| Sowing*Genotype | autumn | Chiaro | -0.3225 | 0.1739 | 30 | -1.85 | 0.0735 | 
| Sowing*Genotype | autumn | Collameno | -0.08000 | 0.1739 | 30 | -0.46 | 0.6487 | 
| Sowing*Genotype | autumn | Palomb | 0.4750 | 0.1739 | 30 | 2.73 | 0.0104 | 
| Sowing*Genotype | autumn | Scuro | -0.9475 | 0.1739 | 30 | -5.45 | <.0001 | 
| Sowing*Genotype | autumn | Sicania | 0.08500 | 0.1739 | 30 | 0.49 | 0.6285 | 
| Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Collameno | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Palomb | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Scuro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Sicania | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . | 
| Type III Tests of Fixed Effects | ||||
|---|---|---|---|---|
| Effect | Num DF | Den DF | F Value | Pr > F | 
| Genotype | 5 | 30 | 7.90 | <.0001 | 
| Sowing | 1 | 3 | 344.95 | 0.0003 | 
| Sowing*Genotype | 5 | 30 | 15.05 | <.0001 | 
| Solution for Random Effects | |||||||
|---|---|---|---|---|---|---|---|
| Effect | Sowing | Block | Estimate | Std Err Pred | DF | t Value | Pr > |t| | 
| Block | 1 | -0.1103 | 0.06610 | 30 | -1.67 | 0.1056 | |
| Block | 2 | -0.00142 | 0.06610 | 30 | -0.02 | 0.9829 | |
| Block | 3 | 0.05728 | 0.06610 | 30 | 0.87 | 0.3931 | |
| Block | 4 | 0.05443 | 0.06610 | 30 | 0.82 | 0.4168 | |
| Sowing(Block) | autumn | 1 | -0.04926 | 0.05122 | 30 | -0.96 | 0.3439 | 
| Sowing(Block) | spring | 1 | 0.009269 | 0.05122 | 30 | 0.18 | 0.8576 | 
| Sowing(Block) | autumn | 2 | -0.00761 | 0.05122 | 30 | -0.15 | 0.8828 | 
| Sowing(Block) | spring | 2 | 0.007098 | 0.05122 | 30 | 0.14 | 0.8907 | 
| Sowing(Block) | autumn | 3 | 0.01578 | 0.05122 | 30 | 0.31 | 0.7601 | 
| Sowing(Block) | spring | 3 | 0.004990 | 0.05122 | 30 | 0.10 | 0.9230 | 
| Sowing(Block) | autumn | 4 | 0.04109 | 0.05122 | 30 | 0.80 | 0.4287 | 
| Sowing(Block) | spring | 4 | -0.02136 | 0.05122 | 30 | -0.42 | 0.6797 | 
fit_1env_fixef <- lmer(Yield ~ Genotype * Sowing + (1 | Block) + (1 | Sowing:Block), 
                       data = fava, 
                       subset = Location == 'papiano' & Year == '2002-2003')
summary(fit_1env_fixef)Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Yield ~ Genotype * Sowing + (1 | Block) + (1 | Sowing:Block)
   Data: fava
 Subset: Location == "papiano" & Year == "2002-2003"
REML criterion at convergence: -0.7
Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.94377 -0.48885  0.07059  0.49097  1.63660 
Random effects:
 Groups       Name        Variance Std.Dev.
 Sowing:Block (Intercept) 0.003253 0.05703 
 Block        (Intercept) 0.008973 0.09472 
 Residual                 0.030229 0.17386 
Number of obs: 48, groups:  Sowing:Block, 8; Block, 4
Fixed effects:
                               Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)                      2.4100     0.1030 21.3948  23.393  < 2e-16 ***
GenotypeCollameno               -0.1775     0.1229 30.0000  -1.444  0.15916    
GenotypePalomb                   0.2750     0.1229 30.0000   2.237  0.03288 *  
GenotypeScuro                   -0.2825     0.1229 30.0000  -2.298  0.02872 *  
GenotypeSicania                 -0.1750     0.1229 30.0000  -1.423  0.16492    
GenotypeVesuvio                 -0.1450     0.1229 30.0000  -1.179  0.24750    
Sowingspring                    -1.0050     0.1294 25.4398  -7.767 3.56e-08 ***
GenotypeCollameno:Sowingspring  -0.2425     0.1739 30.0000  -1.395  0.17333    
GenotypePalomb:Sowingspring     -0.7975     0.1739 30.0000  -4.587 7.46e-05 ***
GenotypeScuro:Sowingspring       0.6250     0.1739 30.0000   3.595  0.00115 ** 
GenotypeSicania:Sowingspring    -0.4075     0.1739 30.0000  -2.344  0.02591 *  
GenotypeVesuvio:Sowingspring    -0.3225     0.1739 30.0000  -1.855  0.07346 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
            (Intr) GntypC GntypP GntypScr GntypScn GntypV Swngsp GntC:S GntP:S
GentypCllmn -0.597                                                            
GenotypPlmb -0.597  0.500                                                     
GenotypeScr -0.597  0.500  0.500                                              
GenotypeScn -0.597  0.500  0.500  0.500                                       
GenotypeVsv -0.597  0.500  0.500  0.500    0.500                              
Sowingsprng -0.628  0.475  0.475  0.475    0.475    0.475                     
GntypCllm:S  0.422 -0.707 -0.354 -0.354   -0.354   -0.354 -0.672              
GntypPlmb:S  0.422 -0.354 -0.707 -0.354   -0.354   -0.354 -0.672  0.500       
GntypScr:Sw  0.422 -0.354 -0.354 -0.707   -0.354   -0.354 -0.672  0.500  0.500
GntypScn:Sw  0.422 -0.354 -0.354 -0.354   -0.707   -0.354 -0.672  0.500  0.500
GntypVsv:Sw  0.422 -0.354 -0.354 -0.354   -0.354   -0.707 -0.672  0.500  0.500
            GntypScr:S GntypScn:S
GentypCllmn                      
GenotypPlmb                      
GenotypeScr                      
GenotypeScn                      
GenotypeVsv                      
Sowingsprng                      
GntypCllm:S                      
GntypPlmb:S                      
GntypScr:Sw                      
GntypScn:Sw  0.500               
GntypVsv:Sw  0.500      0.500    ranef(fit_1env_fixef)$`Sowing:Block`
          (Intercept)
autumn:1 -0.049255030
autumn:2 -0.007614710
autumn:3  0.015778601
autumn:4  0.041091139
spring:1  0.009269246
spring:2  0.007098097
spring:3  0.004989209
spring:4 -0.021356553
$Block
   (Intercept)
1 -0.110292120
2 -0.001424963
3  0.057283504
4  0.054433579
with conditional variances for "Sowing:Block" "Block" icc(fit_1env_fixef, by_group = TRUE)# ICC by Group
Group        |   ICC
--------------------
Sowing:Block | 0.077
Block        | 0.211We have a lot more output to wade through in the summary because we now have a lot of fixed effects. There are 5 coefficients for genotype because there are 6 genotypes, and the first one is the reference level (intercept). There is 1 coefficient for sowing treatment to differentiate spring from the reference level of autumn. Finally, there are 5 interaction coefficients. The summary also shows us the correlation between the fixed effects, which in this case we can safely ignore because the fixed effects were experimentally randomized. You might look at those for an observational study.
We now see completely different ICC results. We are not surprised
that the ICCs are much lower because we’ve added fixed effects that
explain a lot of the variation. They add up to about 29% of the
variation now. However, we see that now that we have accounted for
genotype and sowing as fixed effects, the block term is the larger
contributor to the random variation relative to main plot
(Block:Sowing). This makes sense because the sowing effect
seems to be large and when we did not include sowing in the model as a
fixed effect, all that variation was assigned to main plot. Now that
sowing is included, the main plot random variation is very low.
PROTIP: As before, we can get the ICC result from SAS by working directly with the variance components and getting the ratio of the variance due to each level of random effect to the total variance. ICC of main plot is
.003253 / (.008972 + .003253 + 0.03023) = 0.077, and ICC of block is.008972 / (.008972 + .003253 + 0.03023) = 0.211.
PROTIP 2: If you really want an ANOVA table for your mixed model in R, as long as you fit the model with the lmerTest package loaded, you can use
anova(fit)to get an ANOVA table, wherefitis the name of your fitted model object. You can specify the method of approximating degrees of freedom:anova(fit, ddf = 'Kenward-Roger')
So far so good, but all of that was only a model fit to one environment. We want to fit a model to all the data put together so that we can see whether there are consistent effects of genotype and sowing treatment on yield, that hold regardless of environmental conditions.
Should environment be fixed or random? Answer: It depends. Here, we are going to treat environment (combination of location and year) as a random effect. This means we are considering the environments at which the study was conducted to be samples from a population of potential environments. We want to make an inference about whether the effects we observe are consistent across that whole hypothetical population. We don’t necessarily care that much about specifically saying whether environment A is different from environment B. If we cared more about that, we could consider treating environment as a fixed effect in the analysis.
This means that we need to add more random intercept terms. In the previous model, we modeled all observations from the same block and from the same main plot within each block as being correlated with one another. We still want to do that, but we are going to add another level of nesting. We also want to model all observations from the same environment as being correlated with one another, as well as from the same block within environment, and the same main plot within block within environment. We are really putting the “multi” into “multilevel model!”
proc glimmix data = fava;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Environment Block(Environment) Sowing(Block Environment) / solution;
  ods select parameterestimates tests3;
run;| Solutions for Fixed Effects | |||||||
|---|---|---|---|---|---|---|---|
| Effect | Sowing | Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | 
| Intercept | 2.4146 | 0.4770 | 5 | 5.06 | 0.0039 | ||
| Genotype | Chiaro | 0.4333 | 0.1170 | 230 | 3.70 | 0.0003 | |
| Genotype | Collameno | -0.05167 | 0.1170 | 230 | -0.44 | 0.6592 | |
| Genotype | Palomb | 0.02625 | 0.1170 | 230 | 0.22 | 0.8227 | |
| Genotype | Scuro | 0.6850 | 0.1170 | 230 | 5.85 | <.0001 | |
| Genotype | Sicania | 0.3779 | 0.1170 | 230 | 3.23 | 0.0014 | |
| Genotype | Vesuvio | 0 | . | . | . | . | |
| Sowing | autumn | 1.1308 | 0.1367 | 23 | 8.27 | <.0001 | |
| Sowing | spring | 0 | . | . | . | . | |
| Sowing*Genotype | autumn | Chiaro | -0.4017 | 0.1655 | 230 | -2.43 | 0.0160 | 
| Sowing*Genotype | autumn | Collameno | -0.1775 | 0.1655 | 230 | -1.07 | 0.2845 | 
| Sowing*Genotype | autumn | Palomb | 0.1825 | 0.1655 | 230 | 1.10 | 0.2712 | 
| Sowing*Genotype | autumn | Scuro | -0.5933 | 0.1655 | 230 | -3.59 | 0.0004 | 
| Sowing*Genotype | autumn | Sicania | -0.4725 | 0.1655 | 230 | -2.86 | 0.0047 | 
| Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Collameno | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Palomb | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Scuro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Sicania | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . | 
| Type III Tests of Fixed Effects | ||||
|---|---|---|---|---|
| Effect | Num DF | Den DF | F Value | Pr > F | 
| Genotype | 5 | 230 | 9.78 | <.0001 | 
| Sowing | 1 | 23 | 107.93 | <.0001 | 
| Sowing*Genotype | 5 | 230 | 6.50 | <.0001 | 
We remove the subset argument to fit the model to the
full dataset.
fit_allenv <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Block:Environment) + (1 | Sowing:Block:Environment), 
                   data = fava)boundary (singular) fit: see help('isSingular')We get a message that the fit is singular. What does this mean? Some components of the variance-covariance matrix of the random effects are either exactly zero or exactly one. OK, what about in English? Basically it means that the algorithm that fits the model parameters doesn’t have enough data to get a good estimate. This often happens when we are trying to fit a model that is too complex for the amount of data we have, or when the random effects are very small and can’t be distinguished from zero. The algorithm fails to estimate the random effects and return values of zero. Or in other cases, some of the random effects, or combinations of random effects, will be perfectly correlated with one another. This often happens when you are trying to fit a model with complex nested structure to a fairly small dataset … just like we are doing here!
To learn more about dealing with this issue, check out this article by Ben Bolker. Warning: the language is a little technical.
Returning to our model, let’s look at the variance-covariance matrix of the random effects to see why the fit is singular.
VarCorr(fit_allenv) Groups                   Name        Std.Dev.
 Sowing:Block:Environment (Intercept) 0.24525 
 Block:Environment        (Intercept) 0.00000 
 Environment              (Intercept) 1.14372 
 Residual                             0.40529 We can see that the blocks within each environment have zero variance. This shows that the effect is so small that the algorithm could not estimate it. We can refit the model with that effect removed, and only random effects for environment and main plot within block within environment.
proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random Environment Sowing(Block Environment) / solution;
  ods select parameterestimates tests3;
run;| Solutions for Fixed Effects | |||||||
|---|---|---|---|---|---|---|---|
| Effect | Sowing | Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | 
| Intercept | 2.4146 | 0.4768 | 5 | 5.06 | 0.0039 | ||
| Genotype | Chiaro | 0.4333 | 0.1170 | 230 | 3.70 | 0.0003 | |
| Genotype | Collameno | -0.05167 | 0.1170 | 230 | -0.44 | 0.6592 | |
| Genotype | Palomb | 0.02625 | 0.1170 | 230 | 0.22 | 0.8227 | |
| Genotype | Scuro | 0.6850 | 0.1170 | 230 | 5.85 | <.0001 | |
| Genotype | Sicania | 0.3779 | 0.1170 | 230 | 3.23 | 0.0014 | |
| Genotype | Vesuvio | 0 | . | . | . | . | |
| Sowing | autumn | 1.1308 | 0.1367 | 41 | 8.27 | <.0001 | |
| Sowing | spring | 0 | . | . | . | . | |
| Sowing*Genotype | autumn | Chiaro | -0.4017 | 0.1655 | 230 | -2.43 | 0.0160 | 
| Sowing*Genotype | autumn | Collameno | -0.1775 | 0.1655 | 230 | -1.07 | 0.2845 | 
| Sowing*Genotype | autumn | Palomb | 0.1825 | 0.1655 | 230 | 1.10 | 0.2712 | 
| Sowing*Genotype | autumn | Scuro | -0.5933 | 0.1655 | 230 | -3.59 | 0.0004 | 
| Sowing*Genotype | autumn | Sicania | -0.4725 | 0.1655 | 230 | -2.86 | 0.0047 | 
| Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Collameno | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Palomb | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Scuro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Sicania | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . | 
| Type III Tests of Fixed Effects | ||||
|---|---|---|---|---|
| Effect | Num DF | Den DF | F Value | Pr > F | 
| Genotype | 5 | 230 | 9.78 | <.0001 | 
| Sowing | 1 | 41 | 107.89 | <.0001 | 
| Sowing*Genotype | 5 | 230 | 6.50 | <.0001 | 
fit_allenv_noblock <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Sowing:Block:Environment), 
                           data = fava)We no longer get the singular fit message.
Now that we have a good model, we can look at the model diagnostics.
This was done in SAS by adding the option
plots = residualpanel. We will use
check_model() from the easystats package
in R, as we did in the previous lesson. Specify we want to check
linearity and homogeneity of the residuals, and show a normal
quantile-quantile plot of the residuals.
check_model(fit_allenv_noblock, check = c('linearity', 'homogeneity', 'qq'))All of these diagnostic plots look great. There is little to no trend in the residual variance. The individual data points’ residuals fall along the straight line of the normal Q-Q plot.
One other thing we may want to include in our model is to allow one or more of the treatment effects to vary by environment. In the model we just fit, we only included random intercepts. This means that the “baseline” yield is allowed to vary between environments, but we are assuming a single effect size of treatment and genotype that holds across all environments. We can relax that constraint using “random slopes.” We can allow one or more of the effects to vary by environment. We could even have random slopes at the block or main plot level, but not only would that be way too complex of a model for our small dataset, it also doesn’t seem as likely to be a large effect. So we will stick with testing random slope terms by environment.
We can start with a complex model that includes random slope terms
for both sowing treatment and genotype, with respect to environment. In
the lme4 package in R, we can do this by adding terms
on the design side of each random effect in the model formula.
In this case, we want not only to have a different intercept for each
environment, but also to have a different slope for sowing and for
genotype. So the Environment term in the formula will be
(1 + Genotype + Sowing | Environment).
proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random intercept Genotype Sowing / subject = Environment;
    random intercept / subject = Sowing(Block Environment);
  ods select parameterestimates tests3;
run;| Solutions for Fixed Effects | |||||||
|---|---|---|---|---|---|---|---|
| Effect | Sowing | Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | 
| Intercept | 2.4146 | 0.4881 | 5 | 4.95 | 0.0043 | ||
| Genotype | Chiaro | 0.4333 | 0.1467 | 25 | 2.95 | 0.0068 | |
| Genotype | Collameno | -0.05167 | 0.1467 | 25 | -0.35 | 0.7277 | |
| Genotype | Palomb | 0.02625 | 0.1467 | 25 | 0.18 | 0.8595 | |
| Genotype | Scuro | 0.6850 | 0.1467 | 25 | 4.67 | <.0001 | |
| Genotype | Sicania | 0.3779 | 0.1467 | 25 | 2.58 | 0.0163 | |
| Genotype | Vesuvio | 0 | . | . | . | . | |
| Sowing | autumn | 1.1308 | 0.2176 | 5 | 5.20 | 0.0035 | |
| Sowing | spring | 0 | . | . | . | . | |
| Sowing*Genotype | autumn | Chiaro | -0.4017 | 0.1517 | 205 | -2.65 | 0.0087 | 
| Sowing*Genotype | autumn | Collameno | -0.1775 | 0.1517 | 205 | -1.17 | 0.2434 | 
| Sowing*Genotype | autumn | Palomb | 0.1825 | 0.1517 | 205 | 1.20 | 0.2304 | 
| Sowing*Genotype | autumn | Scuro | -0.5933 | 0.1517 | 205 | -3.91 | 0.0001 | 
| Sowing*Genotype | autumn | Sicania | -0.4725 | 0.1517 | 205 | -3.11 | 0.0021 | 
| Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Collameno | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Palomb | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Scuro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Sicania | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . | 
| Type III Tests of Fixed Effects | ||||
|---|---|---|---|---|
| Effect | Num DF | Den DF | F Value | Pr > F | 
| Genotype | 5 | 25 | 4.24 | 0.0063 | 
| Sowing | 1 | 5 | 20.84 | 0.0060 | 
| Sowing*Genotype | 5 | 205 | 7.73 | <.0001 | 
fit_allenv_allrandomslopes <- lmer(Yield ~ Genotype * Sowing + (1 + Genotype + Sowing | Environment) + (1 | Sowing:Block:Environment), 
                                   data = fava)boundary (singular) fit: see help('isSingular')Again, we get a singular fit message. Very likely, we are trying to fit a too-complex model to this small dataset.
VarCorr(fit_allenv_allrandomslopes) Groups                   Name              Std.Dev. Corr                
 Sowing:Block:Environment (Intercept)       0.119970                     
 Environment              (Intercept)       0.981421                     
                          GenotypeCollameno 0.314009 -0.107              
                          GenotypePalomb    0.245292  0.392 -0.144       
                          GenotypeScuro     0.168246  0.519 -0.810  0.641
                          GenotypeSicania   0.223958  0.920 -0.442  0.229
                          GenotypeVesuvio   0.071028  0.390 -0.481  0.937
                          Sowingspring      0.459510  0.325  0.876  0.250
 Residual                                   0.359447                     
                     
                     
                     
                     
                     
                     
  0.669              
  0.855  0.364       
 -0.433 -0.060 -0.086
                     The variance-covariance matrix of the random effects doesn’t show any that are all zero. However, you can get a singular fit even when none of the individual random effects are zero. It can also occur when linear combinations of some of the random effects are equal to one another.
Let’s simplify by getting rid of the random slope for genotype and only keeping the random slope for sowing.
proc glimmix data = fava plots = residualpanel;
    class Environment Block Sowing Genotype;
    model Yield = Genotype|Sowing / solution;
    random intercept Sowing / subject = Environment solution;
    random intercept / subject = Sowing(Block Environment);
  ods select parameterestimates tests3;
    store fit_allenv_sowingrandomslope;
run;| Solutions for Fixed Effects | |||||||
|---|---|---|---|---|---|---|---|
| Effect | Sowing | Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | 
| Intercept | 2.4146 | 0.4847 | 5 | 4.98 | 0.0042 | ||
| Genotype | Chiaro | 0.4333 | 0.1170 | 230 | 3.70 | 0.0003 | |
| Genotype | Collameno | -0.05167 | 0.1170 | 230 | -0.44 | 0.6592 | |
| Genotype | Palomb | 0.02625 | 0.1170 | 230 | 0.22 | 0.8227 | |
| Genotype | Scuro | 0.6850 | 0.1170 | 230 | 5.85 | <.0001 | |
| Genotype | Sicania | 0.3779 | 0.1170 | 230 | 3.23 | 0.0014 | |
| Genotype | Vesuvio | 0 | . | . | . | . | |
| Sowing | autumn | 1.1308 | 0.2217 | 5 | 5.10 | 0.0038 | |
| Sowing | spring | 0 | . | . | . | . | |
| Sowing*Genotype | autumn | Chiaro | -0.4017 | 0.1655 | 230 | -2.43 | 0.0160 | 
| Sowing*Genotype | autumn | Collameno | -0.1775 | 0.1655 | 230 | -1.07 | 0.2845 | 
| Sowing*Genotype | autumn | Palomb | 0.1825 | 0.1655 | 230 | 1.10 | 0.2712 | 
| Sowing*Genotype | autumn | Scuro | -0.5933 | 0.1655 | 230 | -3.59 | 0.0004 | 
| Sowing*Genotype | autumn | Sicania | -0.4725 | 0.1655 | 230 | -2.86 | 0.0047 | 
| Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Collameno | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Palomb | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Scuro | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Sicania | 0 | . | . | . | . | 
| Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . | 
| Type III Tests of Fixed Effects | ||||
|---|---|---|---|---|
| Effect | Num DF | Den DF | F Value | Pr > F | 
| Genotype | 5 | 230 | 9.78 | <.0001 | 
| Sowing | 1 | 5 | 20.84 | 0.0060 | 
| Sowing*Genotype | 5 | 230 | 6.50 | <.0001 | 
fit_allenv_sowingrandomslope <- lmer(Yield ~ Genotype * Sowing + (1 + Sowing | Environment) + (1 | Sowing:Block:Environment), 
                                     data = fava)This gets rid of the singular fit message. Let’s look at the random intercepts and slopes for environment.
ranef(fit_allenv_sowingrandomslope)$Environment                  (Intercept) Sowingspring
badiola.2001-2002   0.3427129  -0.45877557
papiano.2001-2002   0.9006396  -0.21349718
badiola.2002-2003  -1.4271618  -0.07956543
papiano.2002-2003  -1.2176835  -0.30602556
badiola.2003-2004   0.6962424   0.38184826
papiano.2003-2004   0.7052505   0.67601548This shows that the random slope for the spring sowing treatment is positive for the two 2003-2004 environments. When you look back at a plot of the raw data, this makes sense. The effect of spring sowing on yield is much less negative for those two environments. It is probably a good thing to include the sowing random slope term by environment.
Let’s be thorough and take a look at the model diagnostics for the model we are sticking with, that includes the sowing random slope but not genotype.
check_model(fit_allenv_sowingrandomslope, check = c('linearity', 'homogeneity', 'qq'))We see good residual diagnostics across the board.
Let’s confirm our intuition that the random slope model is a good one
by comparing the models’ Akaike Information Criteria (AIC). In SAS, this
is output by default. In R, we can use the AIC() function
to compare several models’ AIC at once.
AIC(fit_allenv, fit_allenv_noblock, fit_allenv_allrandomslopes, fit_allenv_sowingrandomslope)                             df      AIC
fit_allenv                   16 432.2706
fit_allenv_noblock           15 430.2706
fit_allenv_allrandomslopes   42 413.9765
fit_allenv_sowingrandomslope 17 409.8349We see that the lowest AIC (best fit) is the sowing random slope model, followed by the model including both random slopes. The random-intercept-only models are inferior.
PROTIP: AIC is a relative measure. It can only say what model is a better fit to the data out of the ones you are choosing from. It doesn’t say which one is “right.” For one thing, “all models are wrong,” and for another thing, we might be comparing a set of terrible models! Humility is important when you are doing stats!
A great innovation of the statisticians at SAS was the development of
least-square means (lsmeans). Almost all the SAS model
fitting PROCs have a lsmeans statement. You can use it to
get estimated values of treatment means at different levels of other
fixed effects, and averaged over the random variation. It gives you
confidence limits using an approximation method to estimate the degrees
of freedom, and lets you do pairwise comparisons with proper p-value
adjustment for multiple comparisons.
Statistician Russ Lenth from the University of Iowa implemented all of this same functionality, and then some, in the R package emmeans (estimated marginal means). Estimated marginal means is synonymous with least-square means (it may be a more appropriate term because least-square method isn’t always used to estimate them).
NOTE: I separated out the SAS code for
lsmeansinto a separateproc plmwhich allows you to separate model fitting and post-fitting analyses. That’s a good practice to keep in mind as a SAS user. We had to include astorestatement in our call toproc glimmixso the model object can be restored and used later inproc plm.
proc plm restore = fit_allenv_sowingrandomslope;
    lsmeans Genotype Sowing / diff linestable adjust = sidak;
    slice Genotype * Sowing / sliceby = Sowing diff linestable adjust = sidak;
    ods select lsmeans lsmlines slicediffs slicelines;
run;
NOTE: PROCEDURE PLM used (Total process time):
      real time           0.96 seconds
      cpu time            0.34 seconds
      | Genotype Least Squares Means | |||||
|---|---|---|---|---|---|
| Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | 
| Chiaro | 3.2125 | 0.4719 | 230 | 6.81 | <.0001 | 
| Collameno | 2.8396 | 0.4719 | 230 | 6.02 | <.0001 | 
| Palomb | 3.0975 | 0.4719 | 230 | 6.56 | <.0001 | 
| Scuro | 3.3683 | 0.4719 | 230 | 7.14 | <.0001 | 
| Sicania | 3.1217 | 0.4719 | 230 | 6.62 | <.0001 | 
| Vesuvio | 2.9800 | 0.4719 | 230 | 6.31 | <.0001 | 
| Sidak Grouping for Genotype Least Squares Means (Alpha=0.05) | |||
|---|---|---|---|
| LS-means with the same letter are not significantly different. | |||
| Genotype | Estimate | ||
| Scuro | 3.3683 | A | |
| A | |||
| Chiaro | 3.2125 | B | A | 
| B | |||
| Sicania | 3.1217 | B | |
| B | |||
| Palomb | 3.0975 | B | |
| B | |||
| Vesuvio | 2.9800 | B | C | 
| C | |||
| Collameno | 2.8396 | C | |
| Sowing Least Squares Means | |||||
|---|---|---|---|---|---|
| Sowing | Estimate | Standard Error | DF | t Value | Pr > |t| | 
| autumn | 3.5468 | 0.4788 | 5 | 7.41 | 0.0007 | 
| spring | 2.6597 | 0.4788 | 5 | 5.55 | 0.0026 | 
| Sidak Grouping for Sowing Least Squares Means (Alpha=0.05) | ||
|---|---|---|
| LS-means with the same letter are not significantly different. | ||
| Sowing | Estimate | |
| autumn | 3.5468 | A | 
| spring | 2.6597 | B | 
| Simple Differences of Sowing*Genotype Least Squares Means Adjustment for Multiple Comparisons: Sidak | ||||||||
|---|---|---|---|---|---|---|---|---|
| Slice | Genotype | _Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | Adj P | 
| Sowing autumn | Chiaro | Collameno | 0.2608 | 0.1170 | 230 | 2.23 | 0.0268 | 0.3342 | 
| Sowing autumn | Chiaro | Palomb | -0.1771 | 0.1170 | 230 | -1.51 | 0.1315 | 0.8794 | 
| Sowing autumn | Chiaro | Scuro | -0.06000 | 0.1170 | 230 | -0.51 | 0.6086 | 1.0000 | 
| Sowing autumn | Chiaro | Sicania | 0.1262 | 0.1170 | 230 | 1.08 | 0.2817 | 0.9930 | 
| Sowing autumn | Chiaro | Vesuvio | 0.03167 | 0.1170 | 230 | 0.27 | 0.7869 | 1.0000 | 
| Sowing autumn | Collameno | Palomb | -0.4379 | 0.1170 | 230 | -3.74 | 0.0002 | 0.0034 | 
| Sowing autumn | Collameno | Scuro | -0.3208 | 0.1170 | 230 | -2.74 | 0.0066 | 0.0943 | 
| Sowing autumn | Collameno | Sicania | -0.1346 | 0.1170 | 230 | -1.15 | 0.2512 | 0.9870 | 
| Sowing autumn | Collameno | Vesuvio | -0.2292 | 0.1170 | 230 | -1.96 | 0.0514 | 0.5465 | 
| Sowing autumn | Palomb | Scuro | 0.1171 | 0.1170 | 230 | 1.00 | 0.3180 | 0.9968 | 
| Sowing autumn | Palomb | Sicania | 0.3033 | 0.1170 | 230 | 2.59 | 0.0101 | 0.1417 | 
| Sowing autumn | Palomb | Vesuvio | 0.2087 | 0.1170 | 230 | 1.78 | 0.0757 | 0.6930 | 
| Sowing autumn | Scuro | Sicania | 0.1862 | 0.1170 | 230 | 1.59 | 0.1128 | 0.8338 | 
| Sowing autumn | Scuro | Vesuvio | 0.09167 | 0.1170 | 230 | 0.78 | 0.4341 | 0.9998 | 
| Sowing autumn | Sicania | Vesuvio | -0.09458 | 0.1170 | 230 | -0.81 | 0.4197 | 0.9997 | 
| Sidak Grouping for Sowing*Genotype Least Squares Means Slice (Alpha=0.05) | ||||
|---|---|---|---|---|
| LS-means with the same letter are not significantly different. | ||||
| Slice | Genotype | Estimate | ||
| Sowing autumn | Palomb | 3.7542 | A | |
| Sowing autumn | A | |||
| Sowing autumn | Scuro | 3.6371 | B | A | 
| Sowing autumn | B | A | ||
| Sowing autumn | Chiaro | 3.5771 | B | A | 
| Sowing autumn | B | A | ||
| Sowing autumn | Vesuvio | 3.5454 | B | A | 
| Sowing autumn | B | A | ||
| Sowing autumn | Sicania | 3.4508 | B | A | 
| Sowing autumn | B | |||
| Sowing autumn | Collameno | 3.3162 | B | |
| Simple Differences of Sowing*Genotype Least Squares Means Adjustment for Multiple Comparisons: Sidak | ||||||||
|---|---|---|---|---|---|---|---|---|
| Slice | Genotype | _Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | Adj P | 
| Sowing spring | Chiaro | Collameno | 0.4850 | 0.1170 | 230 | 4.15 | <.0001 | 0.0007 | 
| Sowing spring | Chiaro | Palomb | 0.4071 | 0.1170 | 230 | 3.48 | 0.0006 | 0.0090 | 
| Sowing spring | Chiaro | Scuro | -0.2517 | 0.1170 | 230 | -2.15 | 0.0325 | 0.3909 | 
| Sowing spring | Chiaro | Sicania | 0.05542 | 0.1170 | 230 | 0.47 | 0.6362 | 1.0000 | 
| Sowing spring | Chiaro | Vesuvio | 0.4333 | 0.1170 | 230 | 3.70 | 0.0003 | 0.0040 | 
| Sowing spring | Collameno | Palomb | -0.07792 | 0.1170 | 230 | -0.67 | 0.5061 | 1.0000 | 
| Sowing spring | Collameno | Scuro | -0.7367 | 0.1170 | 230 | -6.30 | <.0001 | <.0001 | 
| Sowing spring | Collameno | Sicania | -0.4296 | 0.1170 | 230 | -3.67 | 0.0003 | 0.0045 | 
| Sowing spring | Collameno | Vesuvio | -0.05167 | 0.1170 | 230 | -0.44 | 0.6592 | 1.0000 | 
| Sowing spring | Palomb | Scuro | -0.6587 | 0.1170 | 230 | -5.63 | <.0001 | <.0001 | 
| Sowing spring | Palomb | Sicania | -0.3517 | 0.1170 | 230 | -3.01 | 0.0029 | 0.0432 | 
| Sowing spring | Palomb | Vesuvio | 0.02625 | 0.1170 | 230 | 0.22 | 0.8227 | 1.0000 | 
| Sowing spring | Scuro | Sicania | 0.3071 | 0.1170 | 230 | 2.62 | 0.0093 | 0.1302 | 
| Sowing spring | Scuro | Vesuvio | 0.6850 | 0.1170 | 230 | 5.85 | <.0001 | <.0001 | 
| Sowing spring | Sicania | Vesuvio | 0.3779 | 0.1170 | 230 | 3.23 | 0.0014 | 0.0211 | 
| Sidak Grouping for Sowing*Genotype Least Squares Means Slice (Alpha=0.05) | |||
|---|---|---|---|
| LS-means with the same letter are not significantly different. | |||
| Slice | Genotype | Estimate | |
| Sowing spring | Scuro | 3.0996 | A | 
| Sowing spring | A | ||
| Sowing spring | Chiaro | 2.8479 | A | 
| Sowing spring | A | ||
| Sowing spring | Sicania | 2.7925 | A | 
| Sowing spring | |||
| Sowing spring | Palomb | 2.4408 | B | 
| Sowing spring | B | ||
| Sowing spring | Vesuvio | 2.4146 | B | 
| Sowing spring | B | ||
| Sowing spring | Collameno | 2.3629 | B | 
We provide the emmeans() function first with the name of
the fitted model object, then a one-sided formula with the fixed effects
that we want to calculate means for on the right-hand side of the
~. If you want multiple effects crossed, use
+. If you want to calculate the means for one treatment
within another, use | for the grouping factor. (The
difference will be clear when we do pairwise contrasts).
emm_geno <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype)NOTE: Results may be misleading due to involvement in interactionsemm_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Sowing)NOTE: Results may be misleading due to involvement in interactionsemm_geno_x_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype + Sowing)
emm_geno_by_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype | Sowing)Type the name of the object to print a handy summary including 95% confidence intervals estimated with the Kenward-Roger method of degrees of freedom approximation.
emm_geno Genotype  emmean    SE   df lower.CL upper.CL
 Chiaro      3.21 0.472 5.13     2.01     4.42
 Collameno   2.84 0.472 5.13     1.64     4.04
 Palomb      3.10 0.472 5.13     1.89     4.30
 Scuro       3.37 0.472 5.13     2.16     4.57
 Sicania     3.12 0.472 5.13     1.92     4.33
 Vesuvio     2.98 0.472 5.13     1.78     4.18
Results are averaged over the levels of: Sowing 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 emm_sowing Sowing emmean    SE df lower.CL upper.CL
 autumn   3.55 0.429  5     2.44     4.65
 spring   2.66 0.524  5     1.31     4.01
Results are averaged over the levels of: Genotype 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 emm_geno_x_sowing Genotype  Sowing emmean    SE   df lower.CL upper.CL
 Chiaro    autumn   3.58 0.435 5.31     2.48     4.68
 Collameno autumn   3.32 0.435 5.31     2.22     4.42
 Palomb    autumn   3.75 0.435 5.31     2.65     4.85
 Scuro     autumn   3.64 0.435 5.31     2.54     4.74
 Sicania   autumn   3.45 0.435 5.31     2.35     4.55
 Vesuvio   autumn   3.55 0.435 5.31     2.45     4.65
 Chiaro    spring   2.85 0.529 5.21     1.50     4.19
 Collameno spring   2.36 0.529 5.21     1.02     3.71
 Palomb    spring   2.44 0.529 5.21     1.10     3.79
 Scuro     spring   3.10 0.529 5.21     1.76     4.44
 Sicania   spring   2.79 0.529 5.21     1.45     4.14
 Vesuvio   spring   2.41 0.529 5.21     1.07     3.76
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 emm_geno_by_sowingSowing = autumn:
 Genotype  emmean    SE   df lower.CL upper.CL
 Chiaro      3.58 0.435 5.31     2.48     4.68
 Collameno   3.32 0.435 5.31     2.22     4.42
 Palomb      3.75 0.435 5.31     2.65     4.85
 Scuro       3.64 0.435 5.31     2.54     4.74
 Sicania     3.45 0.435 5.31     2.35     4.55
 Vesuvio     3.55 0.435 5.31     2.45     4.65
Sowing = spring:
 Genotype  emmean    SE   df lower.CL upper.CL
 Chiaro      2.85 0.529 5.21     1.50     4.19
 Collameno   2.36 0.529 5.21     1.02     3.71
 Palomb      2.44 0.529 5.21     1.10     3.79
 Scuro       3.10 0.529 5.21     1.76     4.44
 Sicania     2.79 0.529 5.21     1.45     4.14
 Vesuvio     2.41 0.529 5.21     1.07     3.76
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 Next, do pairwise treatment contrasts. For the ones involving genotype, there is more than one comparison. Thus we adjust the p-values using the Sidak adjustment for multiple comparisons. Feel free to use your preferred method (Tukey, Holm, Bonferroni, etc.)
contrast(emm_geno, 'pairwise', adjust = 'sidak') contrast            estimate     SE  df t.ratio p.value
 Chiaro - Collameno    0.3729 0.0827 230   4.508  0.0002
 Chiaro - Palomb       0.1150 0.0827 230   1.390  0.9341
 Chiaro - Scuro       -0.1558 0.0827 230  -1.884  0.6102
 Chiaro - Sicania      0.0908 0.0827 230   1.098  0.9917
 Chiaro - Vesuvio      0.2325 0.0827 230   2.810  0.0777
 Collameno - Palomb   -0.2579 0.0827 230  -3.118  0.0304
 Collameno - Scuro    -0.5288 0.0827 230  -6.391  <.0001
 Collameno - Sicania  -0.2821 0.0827 230  -3.410  0.0115
 Collameno - Vesuvio  -0.1404 0.0827 230  -1.697  0.7609
 Palomb - Scuro       -0.2708 0.0827 230  -3.274  0.0182
 Palomb - Sicania     -0.0242 0.0827 230  -0.292  1.0000
 Palomb - Vesuvio      0.1175 0.0827 230   1.420  0.9227
 Scuro - Sicania       0.2467 0.0827 230   2.982  0.0466
 Scuro - Vesuvio       0.3883 0.0827 230   4.694  0.0001
 Sicania - Vesuvio     0.1417 0.0827 230   1.712  0.7495
Results are averaged over the levels of: Sowing 
Degrees-of-freedom method: kenward-roger 
P value adjustment: sidak method for 15 tests contrast(emm_sowing, 'pairwise') contrast        estimate    SE df t.ratio p.value
 autumn - spring    0.887 0.194  5   4.565  0.0060
Results are averaged over the levels of: Genotype 
Degrees-of-freedom method: kenward-roger I won’t print the results of the pairwise contrasts for multiple factors here, but try these out to see how they differ.
contrast(emm_geno_x_sowing, 'pairwise', adjust = 'sidak')
contrast(emm_geno_by_sowing, 'pairwise', adjust = 'sidak')You may also want to get multiple comparison letters to concisely
summarize the significant differences (this is done with the
lines and/or linestable options in the
lsmeans statement in SAS). In R, the
multcomp package has a function called
cld(), for “compact letter display.” The first argument to
cld() is an emmeans object. We also provide
our favorite multiple-comparison adjustment method with the
adjust argument. The argument
Letters = letters makes the group labels
a,b,c... instead of 1,2,3....
cld(emm_geno, adjust = 'sidak', Letters = letters) Genotype  emmean    SE   df lower.CL upper.CL .group
 Collameno   2.84 0.472 5.13    0.886     4.79  a    
 Vesuvio     2.98 0.472 5.13    1.026     4.93  ab   
 Palomb      3.10 0.472 5.13    1.144     5.05   b   
 Sicania     3.12 0.472 5.13    1.168     5.08   b   
 Chiaro      3.21 0.472 5.13    1.259     5.17   bc  
 Scuro       3.37 0.472 5.13    1.415     5.32    c  
Results are averaged over the levels of: Sowing 
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: sidak method for 6 estimates 
P value adjustment: sidak method for 15 tests 
significance level used: alpha = 0.05 
NOTE: If two or more means share the same grouping symbol,
      then we cannot show them to be different.
      But we also did not show them to be the same. cld(emm_geno_by_sowing, adjust = 'sidak', Letters = letters)Sowing = autumn:
 Genotype  emmean    SE   df lower.CL upper.CL .group
 Collameno   3.32 0.435 5.31    1.545     5.09  a    
 Sicania     3.45 0.435 5.31    1.680     5.22  ab   
 Vesuvio     3.55 0.435 5.31    1.775     5.32  ab   
 Chiaro      3.58 0.435 5.31    1.806     5.35  ab   
 Scuro       3.64 0.435 5.31    1.866     5.41  ab   
 Palomb      3.75 0.435 5.31    1.983     5.53   b   
Sowing = spring:
 Genotype  emmean    SE   df lower.CL upper.CL .group
 Collameno   2.36 0.529 5.21    0.188     4.54  a    
 Vesuvio     2.41 0.529 5.21    0.240     4.59  a    
 Palomb      2.44 0.529 5.21    0.266     4.62  a    
 Sicania     2.79 0.529 5.21    0.618     4.97   b   
 Chiaro      2.85 0.529 5.21    0.673     5.02   b   
 Scuro       3.10 0.529 5.21    0.925     5.27   b   
Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: sidak method for 6 estimates 
P value adjustment: sidak method for 15 tests 
significance level used: alpha = 0.05 
NOTE: If two or more means share the same grouping symbol,
      then we cannot show them to be different.
      But we also did not show them to be the same. PROTIP: You can call the function
plot()and pass any of youremmeansorcontrastobjects to it as an argument. There are built-in methods that create plots of the means or contrasts, and their 95% confidence intervals. I will not get into that here. In future lessons we will take a closer look at those plots and how to customize them.
What have you learned in this lesson?
Impressive!
This was Lesson 2 in a series. Future lessons will cover:
As in the previous lesson, here are some exercises for additional practice. These exercises give you some SAS code and ask you to write some R code that does the same job or close to it.
For these exercises, we will be using a classic dataset from the notes of Gertrude Cox, a pioneering statistician who incidentally was the first female professor at N.C. State University! This dataset comes from a strip-split-plot experiment examining how soil type, fertilizer, and calcium affect barley yield.
The barley dataset is provided with the R package
agridat. You can load it into your R workspace by
calling the function data() after loading the package.
library(agridat)
data('cox.stripsplit')In the experiment, there are four replicated blocks. Within each block, there are four vertical strips each with a different fertilizer treatment. Each strip is split in half to yield two subplots per strip, each with a different level of calcium. Perpendicularly to the fertilizer strips there are horizontal strips with different soil types.
To help visualize, this is what one of the four blocks might look like. Note that the fertilizer (F, shown as different colors) is applied in vertical strips in a random order and the soil type (S, shown as different shadings) is applied perpendicularly. Within each plot formed by the intersection of two strips, there is a subplot for each calcium (C) treatment, arranged randomly. The other three blocks would look similar but with different randomized locations of F, S, and C levels.
More details can be found in the help documentation for the dataset.
Type ?cox.stripsplit.
Specify the model with the fixed and random effects we identified in
Exercise 1. Include only random intercept effects, no random slopes. The
model and random statements to fit the same
model using SAS proc glimmix are:
model yield = soil|fert|calcium / ddfm=kr;
random rep fert(rep) soil(rep) fert(soil*rep);*, which will estimate main effects
and all interactions.(1 | <something>).Did you get any warning messages when fitting the model? Can you diagnose the issue? What, if anything, should you do about it?
Fit the model again, this time adding a random slope for the main
effect of calcium, grouped by rep. This corresponds to the following two
random statements in proc glimmix:
random fert(rep) soil(rep);
random intercept calcium / subject=rep;+ calcium
somewhere in the random effects part of the model formula.What happens when you do this? Can you diagnose the issue?
Write code that gets a similar result as you would get if you
inserted the following slice statement into your SAS
proc glimmix code, for the model with only random
intercepts.
slice fert*calcium / sliceby=calcium diff cl linestable adjust=tukey adjdfe=row;Your code should do the following:
emmeans(), contrast(), and
cld().adjust = 'tukey' in two of the three function
calls.