This workshop is intended for SAS users who want to learn R. The people who will get the most out of this course are practicing researchers who have a decent working knowledge of SAS, and of basic statistical analysis (descriptive stats and regression models) as it applies to their field.
This is lesson 2 in a series. Lesson 1 covered the basics: importing data, cleaning and reshaping data, summary statistics, simple graphs and tables, and a few simple statistical models.
During this workshop, you will …
As in Lesson 1, we will work through a “data to doc” pipeline in R, comparing R code to SAS code each step of the way. We will use a different dataset this time.
We will …
...
with
code)As in the previous lesson, we will start with raw data and work our way to a finished product. The first few steps of the pipeline will not be completely new to you if you did Lesson 1 … but it is good to get some extra practice!
Here we’ll load the R packages we are going to work with today. These are mostly the same as Lesson 1. This includes the tidyverse package for reading, manipulating, and plotting data, the lme4 package for fitting linear mixed models, and the easystats package which has some good model diagnostic plots. Also set a default plotting theme.
Note about packages: R has lots of mixed model fitting packages. This fits in with the idea that the best thing about R is “there are many ways to do something,” and the worst thing about R is also “there are many ways to do something.” I would say that nowadays the lme4 package is probably the most widely used package for fitting linear mixed models, so it’s important to be familiar with it if you are doing stats with R. The nlme package also has some useful capabilities so I would also recommend familiarizing yourself with it. Even more complex models can be fit with glmmTMB and other packages. Ultimately, you can fit multilevel models of any level of sophistication using Bayesian methods with packages like brms.
library(tidyverse)
library(lme4)
library(easystats)
library(lmerTest)
library(emmeans)
library(multcomp)
theme_set(theme_bw())
In this lesson, we’re going to use a dataset kindly provided by Andrea Onofri, a regular contributor to r-bloggers. This tutorial is loosely based on this blog post.
fava beans growing in the field … and cooked into some
delicious ful medames! (Images by Josep Gesti and News from
Jerusalem)
Today we are going to explore the wonderful world of fava beans! This
dataset, fava.csv
, includes the results of a study
comparing the performance (yield) of six different fava bean genotypes
under two different sowing treatments: spring and autumn sowing.
Let’s try to understand the experimental design. This study was designed as a split-plot experiment with multiple replicates (blocks). Sowing time (spring and autumn) is the main plot factor and genotype (Chiaro, Collameno, Palomb, Scuro, Sicania, and Vesuvio) is the subplot factor. There are 2 sowing treatments and 6 genotypes, for a total of 12 genotype-treatment combinations. Furthermore, the experiment was repeated in two locations in Italy (Badiola and Papiano) for three years each (2001-2002, 2002-2003, and 2003-2004).
What does split-plot mean? This means that each block consists of two main plots (one sown in spring and the other in autumn), and each of those main plots is divided up into six subplots, with a different genotype planted in each one. There are four blocks at each of the locations. For the purposes of this lesson, we will consider each combination of location and year (six in all) to be a unique “environment.” It is crucial that we have replication within each environment, as well as repetition of the experiment in different environments so that we can determine how consistent the relative performance of the genotypes is under the different treatments, as environmental conditions vary.
The layout of the experiment at one of the two locations in one of the three seasons might look like the map below. Notice all plots sharing a sowing time are contiguous because the sowing main plots are split into genotype subplots, and notice that all twelve treatment combinations are present in each block. Also notice the randomized order of both treatments. This same setup with different randomized positions is repeated in each of the two locations in each of the three years.
As in Lesson 1, the example data for this lesson are hosted on
GitHub, or in the data
directory if you are on the cloud
server.
filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv";
proc import datafile = csvFile out = fava replace dbms = csv; run;
If you are running this on the cloud server:
fava <- read_csv('data/fava.csv')
Rows: 288 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Location, Year, Sowing, Genotype
dbl (2): Block, Yield
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
If you are running this on your own machine:
fava <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/fava.csv')
In SAS, we would use proc print
and
proc contents
to examine our dataset. In R tidyverse, we
can type the name of the dataframe into the console to see the first few
rows and print the data type of each column. The function
glimpse()
also provides a concise summary.
proc print data = fava(obs = 10); run;
ods select variables;
proc contents data = fava; run;
Obs | Location | Year | Sowing | Genotype | Block | Yield |
---|---|---|---|---|---|---|
1 | papiano | 2002-2003 | autumn | Chiaro | 1 | 2.05 |
2 | papiano | 2002-2003 | autumn | Chiaro | 2 | 2.5 |
3 | papiano | 2002-2003 | autumn | Chiaro | 3 | 2.64 |
4 | papiano | 2002-2003 | autumn | Chiaro | 4 | 2.45 |
5 | papiano | 2002-2003 | autumn | Collameno | 1 | 2.01 |
6 | papiano | 2002-2003 | autumn | Collameno | 2 | 2.19 |
7 | papiano | 2002-2003 | autumn | Collameno | 3 | 2.38 |
8 | papiano | 2002-2003 | autumn | Collameno | 4 | 2.35 |
9 | papiano | 2002-2003 | autumn | Palomb | 2 | 2.38 |
10 | papiano | 2002-2003 | autumn | Palomb | 1 | 2.42 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
5 | Block | Num | 8 | BEST12. | BEST32. |
4 | Genotype | Char | 11 | $11. | $11. |
1 | Location | Char | 9 | $9. | $9. |
3 | Sowing | Char | 8 | $8. | $8. |
2 | Year | Char | 11 | $11. | $11. |
6 | Yield | Num | 8 | BEST12. | BEST32. |
fava
# A tibble: 288 × 6
Location Year Sowing Genotype Block Yield
<chr> <chr> <chr> <chr> <dbl> <dbl>
1 papiano 2002-2003 autumn Chiaro 1 2.05
2 papiano 2002-2003 autumn Chiaro 2 2.5
3 papiano 2002-2003 autumn Chiaro 3 2.64
4 papiano 2002-2003 autumn Chiaro 4 2.45
5 papiano 2002-2003 autumn Collameno 1 2.01
6 papiano 2002-2003 autumn Collameno 2 2.19
7 papiano 2002-2003 autumn Collameno 3 2.38
8 papiano 2002-2003 autumn Collameno 4 2.35
9 papiano 2002-2003 autumn Palomb 2 2.38
10 papiano 2002-2003 autumn Palomb 1 2.42
# ℹ 278 more rows
glimpse(fava)
Rows: 288
Columns: 6
$ Location <chr> "papiano", "papiano", "papiano", "papiano", "papiano", "papia…
$ Year <chr> "2002-2003", "2002-2003", "2002-2003", "2002-2003", "2002-200…
$ Sowing <chr> "autumn", "autumn", "autumn", "autumn", "autumn", "autumn", "…
$ Genotype <chr> "Chiaro", "Chiaro", "Chiaro", "Chiaro", "Collameno", "Collame…
$ Block <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 2, 1, 4, 3, 2, 1, 4, 3, 2, 1, 4, 3, 2…
$ Yield <dbl> 2.05, 2.50, 2.64, 2.45, 2.01, 2.19, 2.38, 2.35, 2.38, 2.42, 2…
Let’s go ahead and convert the Location, Year, Sowing, Genotype, and
Block columns to factors because they are all discrete. This isn’t
strictly necessary for the ones other than Block because R will
automatically convert the character variables to factors when fitting
the models, but it is a good practice. We only need to do this in R
right now because in SAS this is done using the class
statement in the modeling proc.
Also, let’s create a new column called Environment
which
is the combination of Location
and Year
,
because we’re considering location-year combinations to be unique
environments for this analysis. The interaction
function
combines two or more factors to produce a new factor. We’ll do this
in both SAS and R. In SAS it’s done with a data
step.
data fava; set fava;
Environment = catx('_', Location, Year);
run;
The mutate()
function is used to transform existing
columns or create new ones. We do both here. Note that we assign the
result back to fava
, which “overwrites” the dataframe with
our modified one. The pipe operator %>%
is used to put
each data-manipulation operation on its own line for better code
readability.
fava <- fava %>%
mutate(
across(Location:Block, factor),
Environment = interaction(Location, Year)
)
Notice the across()
function is used to apply the same
function, factor()
, to many columns. We can use the
:
shorthand to apply the function to a block of contiguous
columns.
Let’s take a look at the data. In SAS we can use
proc sgplot
with a vbox
statement to make
boxplots, or proc sgpanel
if we want multiple panels. In R
with the ggplot2 package, we use
geom_boxplot()
to draw boxplots, the group
aesthetic mapping to make multiple boxplots split up by a grouping
variable, and facet_grid()
to make a grid of panels split
up by additional variables, each with their own boxplots.
proc sgplot data = fava;
vbox Yield / group = Genotype;
run;
proc sgplot data = fava;
vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;
proc sgpanel data = fava;
panelby Environment;
vbox Yield / category=Genotype group=Sowing groupdisplay=cluster;
run;
ggplot(fava, aes(x = Genotype, y = Yield)) +
geom_boxplot()
ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) +
geom_boxplot()
ggplot(fava, aes(x = Genotype, y = Yield, fill = Sowing, group = interaction(Genotype, Sowing))) +
geom_boxplot() +
facet_grid(Location ~ Year)
PROTIP: I won’t get into plots that much in this lesson, but check out my ggplot2 lesson to learn more.
The first plot shows that when we lump all the data together, the genotypes’ yield distributions are pretty similar. But we have to disaggregate things a little bit to start to see patterns. First, when we split the boxplots for each genotype up by sowing treatment, we can see a consistent pattern where autumn sowing leads to increased yield, with roughly the same effect in all six genotypes. The final plot, “faceted” by location and year, starts to really get interesting when we drill down (real datasets are like that)! We see that the sowing time effect is present in 2001-02 and 2002-03, but seems to be much reduced in 2003-04 in both locations, maybe due to climate variation. Yield was much lower in 2002-03 for both locations. There seems to be a hint of genotype by environment interaction as well.
All of this should make it clear how important it is to repeat experiments under different environmental conditions, and show us that our statistical model will have to account for interactions between genotype and treatment, and for effects that vary between environments.
When we are fitting models, a good way to proceed is to start simple and gradually build up to the final model we want. In this case, let’s start by subsetting the data to one environment only, and see if we can successfully fit a model.
We’ll start even simpler by only fitting the block model (random effects only) and ignoring the treatments for now.
Here, we use a data
step to subset a single environment
(Papiano in 2002-03), then fit a model. For all the SAS model-fitting
code in this lesson, proc glimmix
is a good option.
NOTE: By default,
proc glimmix
produces a lot of output. I have selectively displayed only a little of that output for comparison purposes, usingods select
statements.
data papiano0203; set fava;
where Location = 'papiano' & Year = '2002-2003';
run;
proc glimmix data = papiano0203;
class Environment Block Sowing Genotype;
model Yield = / solution;
random Block Sowing(Block) / solution;
ods select covparms parameterestimates tests3 solutionr;
run;
Covariance Parameter Estimates | ||
---|---|---|
Cov Parm | Estimate |
Standard Error |
Block | 0 | . |
Sowing(Block) | 0.4051 | 0.2263 |
Residual | 0.1094 | 0.02445 |
Solutions for Fixed Effects | |||||
---|---|---|---|---|---|
Effect | Estimate |
Standard Error |
DF | t Value | Pr > |t| |
Intercept | 1.7279 | 0.2300 | 3 | 7.51 | 0.0049 |
Solution for Random Effects | |||||||
---|---|---|---|---|---|---|---|
Effect | Sowing | Block | Estimate | Std Err Pred | DF | t Value | Pr > |t| |
Block | 1 | 0 | . | . | . | . | |
Block | 2 | 0 | . | . | . | . | |
Block | 3 | 0 | . | . | . | . | |
Block | 4 | 0 | . | . | . | . | |
Sowing(Block) | autumn | 1 | 0.3465 | 0.2567 | 40 | 1.35 | 0.1847 |
Sowing(Block) | spring | 1 | -0.6551 | 0.2567 | 40 | -2.55 | 0.0146 |
Sowing(Block) | autumn | 2 | 0.5522 | 0.2567 | 40 | 2.15 | 0.0376 |
Sowing(Block) | spring | 2 | -0.5562 | 0.2567 | 40 | -2.17 | 0.0363 |
Sowing(Block) | autumn | 3 | 0.6655 | 0.2567 | 40 | 2.59 | 0.0133 |
Sowing(Block) | spring | 3 | -0.5052 | 0.2567 | 40 | -1.97 | 0.0560 |
Sowing(Block) | autumn | 4 | 0.7245 | 0.2567 | 40 | 2.82 | 0.0074 |
Sowing(Block) | spring | 4 | -0.5722 | 0.2567 | 40 | -2.23 | 0.0315 |
We will use the lmer()
function from the
lme4 package to fit the model in R. The model formula
syntax is a little different than in SAS.
Yield
) is on the
left side of the formula.~
separates the dependent from the independent
variables.+
.1
).|
.1
on the design side means only fit
random intercepts and no random slopes.(1 | Block)
, we are saying each block will
get its own random intercept.(1 | Sowing:Block)
, we are saying each main
plot (defined by unique combinations of Sowing and Block) will get its
own random intercept.The data
argument specifies the data frame that has the
variables in the formula. Here we’re also using a subset
argument to specify we are only fitting the model to the rows of the
dataframe fava
where Location
is equal to the
string 'papiano'
and Year
is equal to the
string '2002-2003'
(multiple logical conditions are
separated with &
).
fit_1env <- lmer(Yield ~ 1 + (1 | Block) + (1 | Sowing:Block),
data = fava,
subset = Location == 'papiano' & Year == '2002-2003')
Next let’s look at some of the output.
summary(fit_1env)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Yield ~ 1 + (1 | Block) + (1 | Sowing:Block)
Data: fava
Subset: Location == "papiano" & Year == "2002-2003"
REML criterion at convergence: 55.3
Scaled residuals:
Min 1Q Median 3Q Max
-1.3438 -0.6279 -0.2337 0.6757 2.1901
Random effects:
Groups Name Variance Std.Dev.
Sowing:Block (Intercept) 4.051e-01 6.365e-01
Block (Intercept) 9.150e-09 9.566e-05
Residual 1.094e-01 3.307e-01
Number of obs: 48, groups: Sowing:Block, 8; Block, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.728 0.230 7.000 7.511 0.000136 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ranef(fit_1env)
$`Sowing:Block`
(Intercept)
autumn:1 0.3464950
autumn:2 0.5522388
autumn:3 0.6654777
autumn:4 0.7244895
spring:1 -0.6551107
spring:2 -0.5562261
spring:3 -0.5051889
spring:4 -0.5721752
$Block
(Intercept)
1 -6.969903e-09
2 -9.005043e-11
3 3.620027e-09
4 3.439927e-09
with conditional variances for "Sowing:Block" "Block"
icc(fit_1env, by_group = TRUE)
# ICC by Group
Group | ICC
------------------------
Sowing:Block | 0.787
Block | 1.778e-08
The summary()
function gives you a summary of the
residuals, the variance of the different random effects, and the fixed
effects, very similar to what you get from proc glimmix
output. The random effects are block (Block
), main plot
(Sowing:Block
), and subplot (Residual
because
each data point is a subplot, so the model residuals are the errors for
each individual subplot). The fixed effect is the global intercept.
In R, ranef()
is the function that displays the random
effects. This is equivalent to adding the solution
option
to the random
statement in SAS. When we call
ranef()
, we see that some main plots have higher yield than
others. However the random intercepts for blocks are very close to zero.
This indicates that most of the variation is between main plots, with
very little systematic differences between the blocks.
In R the icc()
function provides the intraclass
correlation coefficient separated by random effect term. We see that
around 80% of the variation is at the main plot level at this particular
environment (we haven’t included any fixed effects yet so this is just
telling us what level the variation is found at). Almost no additional
variation is explained by the block term (as we could have guessed from
looking at ranef()
output). The remaining 20% is residual
variation at the subplot level. Because every data point is an
individual subplot, it is not needed to put a special subplot term into
the model. The model residuals, after accounting for any fixed effects,
the block term, and the main plot term, represent the subplot error.
PROTIP: It is fairly straightforward to calculate the ICC in SAS by extracting the variance components from the model and taking ratios, but unfortunately there is no statement in PROC GLIMMIX that does it directly. Thus I didn’t include the ICC in the SAS results here. You can obtain the value 0.787 for the ICC of main plot using the ratio of the variance due to main plot to the total variance:
0.4051 / (0.4051 + 0.1094) = 0.787
. The ICC of block is effectively 0.
Let’s add some fixed effects for genotype, sowing treatment, and
their interaction. The fixed effect portion of the model formula in R
can be written as Genotype + Sowing + Genotype:Sowing
,
where the :
indicates an interaction. Or it can be written
using the shorthand Genotype * Sowing
, where *
indicates all combinations of main effects and interaction terms. The
:
is equivalent to *
in SAS, and the
*
is equivalent to |
(a little confusing)!
proc glimmix data = papiano0203;
class Environment Block Sowing Genotype;
model Yield = Genotype|Sowing / solution;
random Block Sowing(Block) / solution;
ods select covparms parameterestimates tests3 solutionr;
run;
Covariance Parameter Estimates | ||
---|---|---|
Cov Parm | Estimate |
Standard Error |
Block | 0.008972 | 0.01123 |
Sowing(Block) | 0.003253 | 0.006894 |
Residual | 0.03023 | 0.007805 |
Solutions for Fixed Effects | |||||||
---|---|---|---|---|---|---|---|
Effect | Sowing | Genotype | Estimate |
Standard Error |
DF | t Value | Pr > |t| |
Intercept | 0.9375 | 0.1030 | 3 | 9.10 | 0.0028 | ||
Genotype | Chiaro | 0.4675 | 0.1229 | 30 | 3.80 | 0.0007 | |
Genotype | Collameno | 0.04750 | 0.1229 | 30 | 0.39 | 0.7020 | |
Genotype | Palomb | -0.05500 | 0.1229 | 30 | -0.45 | 0.6578 | |
Genotype | Scuro | 0.8100 | 0.1229 | 30 | 6.59 | <.0001 | |
Genotype | Sicania | -0.1150 | 0.1229 | 30 | -0.94 | 0.3570 | |
Genotype | Vesuvio | 0 | . | . | . | . | |
Sowing | autumn | 1.3275 | 0.1294 | 3 | 10.26 | 0.0020 | |
Sowing | spring | 0 | . | . | . | . | |
Sowing*Genotype | autumn | Chiaro | -0.3225 | 0.1739 | 30 | -1.85 | 0.0735 |
Sowing*Genotype | autumn | Collameno | -0.08000 | 0.1739 | 30 | -0.46 | 0.6487 |
Sowing*Genotype | autumn | Palomb | 0.4750 | 0.1739 | 30 | 2.73 | 0.0104 |
Sowing*Genotype | autumn | Scuro | -0.9475 | 0.1739 | 30 | -5.45 | <.0001 |
Sowing*Genotype | autumn | Sicania | 0.08500 | 0.1739 | 30 | 0.49 | 0.6285 |
Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . |
Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . |
Sowing*Genotype | spring | Collameno | 0 | . | . | . | . |
Sowing*Genotype | spring | Palomb | 0 | . | . | . | . |
Sowing*Genotype | spring | Scuro | 0 | . | . | . | . |
Sowing*Genotype | spring | Sicania | 0 | . | . | . | . |
Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
Genotype | 5 | 30 | 7.90 | <.0001 |
Sowing | 1 | 3 | 344.95 | 0.0003 |
Sowing*Genotype | 5 | 30 | 15.05 | <.0001 |
Solution for Random Effects | |||||||
---|---|---|---|---|---|---|---|
Effect | Sowing | Block | Estimate | Std Err Pred | DF | t Value | Pr > |t| |
Block | 1 | -0.1103 | 0.06610 | 30 | -1.67 | 0.1056 | |
Block | 2 | -0.00142 | 0.06610 | 30 | -0.02 | 0.9829 | |
Block | 3 | 0.05728 | 0.06610 | 30 | 0.87 | 0.3931 | |
Block | 4 | 0.05443 | 0.06610 | 30 | 0.82 | 0.4168 | |
Sowing(Block) | autumn | 1 | -0.04926 | 0.05122 | 30 | -0.96 | 0.3439 |
Sowing(Block) | spring | 1 | 0.009269 | 0.05122 | 30 | 0.18 | 0.8576 |
Sowing(Block) | autumn | 2 | -0.00761 | 0.05122 | 30 | -0.15 | 0.8828 |
Sowing(Block) | spring | 2 | 0.007098 | 0.05122 | 30 | 0.14 | 0.8907 |
Sowing(Block) | autumn | 3 | 0.01578 | 0.05122 | 30 | 0.31 | 0.7601 |
Sowing(Block) | spring | 3 | 0.004990 | 0.05122 | 30 | 0.10 | 0.9230 |
Sowing(Block) | autumn | 4 | 0.04109 | 0.05122 | 30 | 0.80 | 0.4287 |
Sowing(Block) | spring | 4 | -0.02136 | 0.05122 | 30 | -0.42 | 0.6797 |
fit_1env_fixef <- lmer(Yield ~ Genotype * Sowing + (1 | Block) + (1 | Sowing:Block),
data = fava,
subset = Location == 'papiano' & Year == '2002-2003')
summary(fit_1env_fixef)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: Yield ~ Genotype * Sowing + (1 | Block) + (1 | Sowing:Block)
Data: fava
Subset: Location == "papiano" & Year == "2002-2003"
REML criterion at convergence: -0.7
Scaled residuals:
Min 1Q Median 3Q Max
-1.94377 -0.48885 0.07059 0.49097 1.63660
Random effects:
Groups Name Variance Std.Dev.
Sowing:Block (Intercept) 0.003253 0.05703
Block (Intercept) 0.008973 0.09472
Residual 0.030229 0.17386
Number of obs: 48, groups: Sowing:Block, 8; Block, 4
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 2.4100 0.1030 21.3948 23.393 < 2e-16 ***
GenotypeCollameno -0.1775 0.1229 30.0000 -1.444 0.15916
GenotypePalomb 0.2750 0.1229 30.0000 2.237 0.03288 *
GenotypeScuro -0.2825 0.1229 30.0000 -2.298 0.02872 *
GenotypeSicania -0.1750 0.1229 30.0000 -1.423 0.16492
GenotypeVesuvio -0.1450 0.1229 30.0000 -1.179 0.24750
Sowingspring -1.0050 0.1294 25.4398 -7.767 3.56e-08 ***
GenotypeCollameno:Sowingspring -0.2425 0.1739 30.0000 -1.395 0.17333
GenotypePalomb:Sowingspring -0.7975 0.1739 30.0000 -4.587 7.46e-05 ***
GenotypeScuro:Sowingspring 0.6250 0.1739 30.0000 3.595 0.00115 **
GenotypeSicania:Sowingspring -0.4075 0.1739 30.0000 -2.344 0.02591 *
GenotypeVesuvio:Sowingspring -0.3225 0.1739 30.0000 -1.855 0.07346 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) GntypC GntypP GntypScr GntypScn GntypV Swngsp GntC:S GntP:S
GentypCllmn -0.597
GenotypPlmb -0.597 0.500
GenotypeScr -0.597 0.500 0.500
GenotypeScn -0.597 0.500 0.500 0.500
GenotypeVsv -0.597 0.500 0.500 0.500 0.500
Sowingsprng -0.628 0.475 0.475 0.475 0.475 0.475
GntypCllm:S 0.422 -0.707 -0.354 -0.354 -0.354 -0.354 -0.672
GntypPlmb:S 0.422 -0.354 -0.707 -0.354 -0.354 -0.354 -0.672 0.500
GntypScr:Sw 0.422 -0.354 -0.354 -0.707 -0.354 -0.354 -0.672 0.500 0.500
GntypScn:Sw 0.422 -0.354 -0.354 -0.354 -0.707 -0.354 -0.672 0.500 0.500
GntypVsv:Sw 0.422 -0.354 -0.354 -0.354 -0.354 -0.707 -0.672 0.500 0.500
GntypScr:S GntypScn:S
GentypCllmn
GenotypPlmb
GenotypeScr
GenotypeScn
GenotypeVsv
Sowingsprng
GntypCllm:S
GntypPlmb:S
GntypScr:Sw
GntypScn:Sw 0.500
GntypVsv:Sw 0.500 0.500
ranef(fit_1env_fixef)
$`Sowing:Block`
(Intercept)
autumn:1 -0.049255030
autumn:2 -0.007614710
autumn:3 0.015778601
autumn:4 0.041091139
spring:1 0.009269246
spring:2 0.007098097
spring:3 0.004989209
spring:4 -0.021356553
$Block
(Intercept)
1 -0.110292120
2 -0.001424963
3 0.057283504
4 0.054433579
with conditional variances for "Sowing:Block" "Block"
icc(fit_1env_fixef, by_group = TRUE)
# ICC by Group
Group | ICC
--------------------
Sowing:Block | 0.077
Block | 0.211
We have a lot more output to wade through in the summary because we now have a lot of fixed effects. There are 5 coefficients for genotype because there are 6 genotypes, and the first one is the reference level (intercept). There is 1 coefficient for sowing treatment to differentiate spring from the reference level of autumn. Finally, there are 5 interaction coefficients. The summary also shows us the correlation between the fixed effects, which in this case we can safely ignore because the fixed effects were experimentally randomized. You might look at those for an observational study.
We now see completely different ICC results. We are not surprised
that the ICCs are much lower because we’ve added fixed effects that
explain a lot of the variation. They add up to about 29% of the
variation now. However, we see that now that we have accounted for
genotype and sowing as fixed effects, the block term is the larger
contributor to the random variation relative to main plot
(Block:Sowing
). This makes sense because the sowing effect
seems to be large and when we did not include sowing in the model as a
fixed effect, all that variation was assigned to main plot. Now that
sowing is included, the main plot random variation is very low.
PROTIP: As before, we can get the ICC result from SAS by working directly with the variance components and getting the ratio of the variance due to each level of random effect to the total variance. ICC of main plot is
.003253 / (.008972 + .003253 + 0.03023) = 0.077
, and ICC of block is.008972 / (.008972 + .003253 + 0.03023) = 0.211
.
PROTIP 2: If you really want an ANOVA table for your mixed model in R, as long as you fit the model with the lmerTest package loaded, you can use
anova(fit)
to get an ANOVA table, wherefit
is the name of your fitted model object. You can specify the method of approximating degrees of freedom:anova(fit, ddf = 'Kenward-Roger')
So far so good, but all of that was only a model fit to one environment. We want to fit a model to all the data put together so that we can see whether there are consistent effects of genotype and sowing treatment on yield, that hold regardless of environmental conditions.
Should environment be fixed or random? Answer: It depends. Here, we are going to treat environment (combination of location and year) as a random effect. This means we are considering the environments at which the study was conducted to be samples from a population of potential environments. We want to make an inference about whether the effects we observe are consistent across that whole hypothetical population. We don’t necessarily care that much about specifically saying whether environment A is different from environment B. If we cared more about that, we could consider treating environment as a fixed effect in the analysis.
This means that we need to add more random intercept terms. In the previous model, we modeled all observations from the same block and from the same main plot within each block as being correlated with one another. We still want to do that, but we are going to add another level of nesting. We also want to model all observations from the same environment as being correlated with one another, as well as from the same block within environment, and the same main plot within block within environment. We are really putting the “multi” into “multilevel model!”
proc glimmix data = fava;
class Environment Block Sowing Genotype;
model Yield = Genotype|Sowing / solution;
random Environment Block(Environment) Sowing(Block Environment) / solution;
ods select parameterestimates tests3;
run;
Solutions for Fixed Effects | |||||||
---|---|---|---|---|---|---|---|
Effect | Sowing | Genotype | Estimate |
Standard Error |
DF | t Value | Pr > |t| |
Intercept | 2.4146 | 0.4770 | 5 | 5.06 | 0.0039 | ||
Genotype | Chiaro | 0.4333 | 0.1170 | 230 | 3.70 | 0.0003 | |
Genotype | Collameno | -0.05167 | 0.1170 | 230 | -0.44 | 0.6592 | |
Genotype | Palomb | 0.02625 | 0.1170 | 230 | 0.22 | 0.8227 | |
Genotype | Scuro | 0.6850 | 0.1170 | 230 | 5.85 | <.0001 | |
Genotype | Sicania | 0.3779 | 0.1170 | 230 | 3.23 | 0.0014 | |
Genotype | Vesuvio | 0 | . | . | . | . | |
Sowing | autumn | 1.1308 | 0.1367 | 23 | 8.27 | <.0001 | |
Sowing | spring | 0 | . | . | . | . | |
Sowing*Genotype | autumn | Chiaro | -0.4017 | 0.1655 | 230 | -2.43 | 0.0160 |
Sowing*Genotype | autumn | Collameno | -0.1775 | 0.1655 | 230 | -1.07 | 0.2845 |
Sowing*Genotype | autumn | Palomb | 0.1825 | 0.1655 | 230 | 1.10 | 0.2712 |
Sowing*Genotype | autumn | Scuro | -0.5933 | 0.1655 | 230 | -3.59 | 0.0004 |
Sowing*Genotype | autumn | Sicania | -0.4725 | 0.1655 | 230 | -2.86 | 0.0047 |
Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . |
Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . |
Sowing*Genotype | spring | Collameno | 0 | . | . | . | . |
Sowing*Genotype | spring | Palomb | 0 | . | . | . | . |
Sowing*Genotype | spring | Scuro | 0 | . | . | . | . |
Sowing*Genotype | spring | Sicania | 0 | . | . | . | . |
Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
Genotype | 5 | 230 | 9.78 | <.0001 |
Sowing | 1 | 23 | 107.93 | <.0001 |
Sowing*Genotype | 5 | 230 | 6.50 | <.0001 |
We remove the subset
argument to fit the model to the
full dataset.
fit_allenv <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Block:Environment) + (1 | Sowing:Block:Environment),
data = fava)
boundary (singular) fit: see help('isSingular')
We get a message that the fit is singular. What does this mean? Some components of the variance-covariance matrix of the random effects are either exactly zero or exactly one. OK, what about in English? Basically it means that the algorithm that fits the model parameters doesn’t have enough data to get a good estimate. This often happens when we are trying to fit a model that is too complex for the amount of data we have, or when the random effects are very small and can’t be distinguished from zero. The algorithm fails to estimate the random effects and return values of zero. Or in other cases, some of the random effects, or combinations of random effects, will be perfectly correlated with one another. This often happens when you are trying to fit a model with complex nested structure to a fairly small dataset … just like we are doing here!
To learn more about dealing with this issue, check out this article by Ben Bolker. Warning: the language is a little technical.
Returning to our model, let’s look at the variance-covariance matrix of the random effects to see why the fit is singular.
VarCorr(fit_allenv)
Groups Name Std.Dev.
Sowing:Block:Environment (Intercept) 0.24525
Block:Environment (Intercept) 0.00000
Environment (Intercept) 1.14372
Residual 0.40529
We can see that the blocks within each environment have zero variance. This shows that the effect is so small that the algorithm could not estimate it. We can refit the model with that effect removed, and only random effects for environment and main plot within block within environment.
proc glimmix data = fava plots = residualpanel;
class Environment Block Sowing Genotype;
model Yield = Genotype|Sowing / solution;
random Environment Sowing(Block Environment) / solution;
ods select parameterestimates tests3;
run;
Solutions for Fixed Effects | |||||||
---|---|---|---|---|---|---|---|
Effect | Sowing | Genotype | Estimate |
Standard Error |
DF | t Value | Pr > |t| |
Intercept | 2.4146 | 0.4768 | 5 | 5.06 | 0.0039 | ||
Genotype | Chiaro | 0.4333 | 0.1170 | 230 | 3.70 | 0.0003 | |
Genotype | Collameno | -0.05167 | 0.1170 | 230 | -0.44 | 0.6592 | |
Genotype | Palomb | 0.02625 | 0.1170 | 230 | 0.22 | 0.8227 | |
Genotype | Scuro | 0.6850 | 0.1170 | 230 | 5.85 | <.0001 | |
Genotype | Sicania | 0.3779 | 0.1170 | 230 | 3.23 | 0.0014 | |
Genotype | Vesuvio | 0 | . | . | . | . | |
Sowing | autumn | 1.1308 | 0.1367 | 41 | 8.27 | <.0001 | |
Sowing | spring | 0 | . | . | . | . | |
Sowing*Genotype | autumn | Chiaro | -0.4017 | 0.1655 | 230 | -2.43 | 0.0160 |
Sowing*Genotype | autumn | Collameno | -0.1775 | 0.1655 | 230 | -1.07 | 0.2845 |
Sowing*Genotype | autumn | Palomb | 0.1825 | 0.1655 | 230 | 1.10 | 0.2712 |
Sowing*Genotype | autumn | Scuro | -0.5933 | 0.1655 | 230 | -3.59 | 0.0004 |
Sowing*Genotype | autumn | Sicania | -0.4725 | 0.1655 | 230 | -2.86 | 0.0047 |
Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . |
Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . |
Sowing*Genotype | spring | Collameno | 0 | . | . | . | . |
Sowing*Genotype | spring | Palomb | 0 | . | . | . | . |
Sowing*Genotype | spring | Scuro | 0 | . | . | . | . |
Sowing*Genotype | spring | Sicania | 0 | . | . | . | . |
Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
Genotype | 5 | 230 | 9.78 | <.0001 |
Sowing | 1 | 41 | 107.89 | <.0001 |
Sowing*Genotype | 5 | 230 | 6.50 | <.0001 |
fit_allenv_noblock <- lmer(Yield ~ Genotype * Sowing + (1 | Environment) + (1 | Sowing:Block:Environment),
data = fava)
We no longer get the singular fit message.
Now that we have a good model, we can look at the model diagnostics.
This was done in SAS by adding the option
plots = residualpanel
. We will use
check_model()
from the easystats package
in R, as we did in the previous lesson. Specify we want to check
linearity and homogeneity of the residuals, and show a normal
quantile-quantile plot of the residuals.
check_model(fit_allenv_noblock, check = c('linearity', 'homogeneity', 'qq'))
All of these diagnostic plots look great. There is little to no trend in the residual variance. The individual data points’ residuals fall along the straight line of the normal Q-Q plot.
One other thing we may want to include in our model is to allow one or more of the treatment effects to vary by environment. In the model we just fit, we only included random intercepts. This means that the “baseline” yield is allowed to vary between environments, but we are assuming a single effect size of treatment and genotype that holds across all environments. We can relax that constraint using “random slopes.” We can allow one or more of the effects to vary by environment. We could even have random slopes at the block or main plot level, but not only would that be way too complex of a model for our small dataset, it also doesn’t seem as likely to be a large effect. So we will stick with testing random slope terms by environment.
We can start with a complex model that includes random slope terms
for both sowing treatment and genotype, with respect to environment. In
the lme4 package in R, we can do this by adding terms
on the design side of each random effect in the model formula.
In this case, we want not only to have a different intercept for each
environment, but also to have a different slope for sowing and for
genotype. So the Environment
term in the formula will be
(1 + Genotype + Sowing | Environment)
.
proc glimmix data = fava plots = residualpanel;
class Environment Block Sowing Genotype;
model Yield = Genotype|Sowing / solution;
random intercept Genotype Sowing / subject = Environment;
random intercept / subject = Sowing(Block Environment);
ods select parameterestimates tests3;
run;
Solutions for Fixed Effects | |||||||
---|---|---|---|---|---|---|---|
Effect | Sowing | Genotype | Estimate |
Standard Error |
DF | t Value | Pr > |t| |
Intercept | 2.4146 | 0.4881 | 5 | 4.95 | 0.0043 | ||
Genotype | Chiaro | 0.4333 | 0.1467 | 25 | 2.95 | 0.0068 | |
Genotype | Collameno | -0.05167 | 0.1467 | 25 | -0.35 | 0.7277 | |
Genotype | Palomb | 0.02625 | 0.1467 | 25 | 0.18 | 0.8595 | |
Genotype | Scuro | 0.6850 | 0.1467 | 25 | 4.67 | <.0001 | |
Genotype | Sicania | 0.3779 | 0.1467 | 25 | 2.58 | 0.0163 | |
Genotype | Vesuvio | 0 | . | . | . | . | |
Sowing | autumn | 1.1308 | 0.2176 | 5 | 5.20 | 0.0035 | |
Sowing | spring | 0 | . | . | . | . | |
Sowing*Genotype | autumn | Chiaro | -0.4017 | 0.1517 | 205 | -2.65 | 0.0087 |
Sowing*Genotype | autumn | Collameno | -0.1775 | 0.1517 | 205 | -1.17 | 0.2434 |
Sowing*Genotype | autumn | Palomb | 0.1825 | 0.1517 | 205 | 1.20 | 0.2304 |
Sowing*Genotype | autumn | Scuro | -0.5933 | 0.1517 | 205 | -3.91 | 0.0001 |
Sowing*Genotype | autumn | Sicania | -0.4725 | 0.1517 | 205 | -3.11 | 0.0021 |
Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . |
Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . |
Sowing*Genotype | spring | Collameno | 0 | . | . | . | . |
Sowing*Genotype | spring | Palomb | 0 | . | . | . | . |
Sowing*Genotype | spring | Scuro | 0 | . | . | . | . |
Sowing*Genotype | spring | Sicania | 0 | . | . | . | . |
Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
Genotype | 5 | 25 | 4.24 | 0.0063 |
Sowing | 1 | 5 | 20.84 | 0.0060 |
Sowing*Genotype | 5 | 205 | 7.73 | <.0001 |
fit_allenv_allrandomslopes <- lmer(Yield ~ Genotype * Sowing + (1 + Genotype + Sowing | Environment) + (1 | Sowing:Block:Environment),
data = fava)
boundary (singular) fit: see help('isSingular')
Again, we get a singular fit message. Very likely, we are trying to fit a too-complex model to this small dataset.
VarCorr(fit_allenv_allrandomslopes)
Groups Name Std.Dev. Corr
Sowing:Block:Environment (Intercept) 0.119970
Environment (Intercept) 0.981421
GenotypeCollameno 0.314009 -0.107
GenotypePalomb 0.245292 0.392 -0.144
GenotypeScuro 0.168246 0.519 -0.810 0.641
GenotypeSicania 0.223958 0.920 -0.442 0.229
GenotypeVesuvio 0.071028 0.390 -0.481 0.937
Sowingspring 0.459510 0.325 0.876 0.250
Residual 0.359447
0.669
0.855 0.364
-0.433 -0.060 -0.086
The variance-covariance matrix of the random effects doesn’t show any that are all zero. However, you can get a singular fit even when none of the individual random effects are zero. It can also occur when linear combinations of some of the random effects are equal to one another.
Let’s simplify by getting rid of the random slope for genotype and only keeping the random slope for sowing.
proc glimmix data = fava plots = residualpanel;
class Environment Block Sowing Genotype;
model Yield = Genotype|Sowing / solution;
random intercept Sowing / subject = Environment solution;
random intercept / subject = Sowing(Block Environment);
ods select parameterestimates tests3;
store fit_allenv_sowingrandomslope;
run;
Solutions for Fixed Effects | |||||||
---|---|---|---|---|---|---|---|
Effect | Sowing | Genotype | Estimate |
Standard Error |
DF | t Value | Pr > |t| |
Intercept | 2.4146 | 0.4847 | 5 | 4.98 | 0.0042 | ||
Genotype | Chiaro | 0.4333 | 0.1170 | 230 | 3.70 | 0.0003 | |
Genotype | Collameno | -0.05167 | 0.1170 | 230 | -0.44 | 0.6592 | |
Genotype | Palomb | 0.02625 | 0.1170 | 230 | 0.22 | 0.8227 | |
Genotype | Scuro | 0.6850 | 0.1170 | 230 | 5.85 | <.0001 | |
Genotype | Sicania | 0.3779 | 0.1170 | 230 | 3.23 | 0.0014 | |
Genotype | Vesuvio | 0 | . | . | . | . | |
Sowing | autumn | 1.1308 | 0.2217 | 5 | 5.10 | 0.0038 | |
Sowing | spring | 0 | . | . | . | . | |
Sowing*Genotype | autumn | Chiaro | -0.4017 | 0.1655 | 230 | -2.43 | 0.0160 |
Sowing*Genotype | autumn | Collameno | -0.1775 | 0.1655 | 230 | -1.07 | 0.2845 |
Sowing*Genotype | autumn | Palomb | 0.1825 | 0.1655 | 230 | 1.10 | 0.2712 |
Sowing*Genotype | autumn | Scuro | -0.5933 | 0.1655 | 230 | -3.59 | 0.0004 |
Sowing*Genotype | autumn | Sicania | -0.4725 | 0.1655 | 230 | -2.86 | 0.0047 |
Sowing*Genotype | autumn | Vesuvio | 0 | . | . | . | . |
Sowing*Genotype | spring | Chiaro | 0 | . | . | . | . |
Sowing*Genotype | spring | Collameno | 0 | . | . | . | . |
Sowing*Genotype | spring | Palomb | 0 | . | . | . | . |
Sowing*Genotype | spring | Scuro | 0 | . | . | . | . |
Sowing*Genotype | spring | Sicania | 0 | . | . | . | . |
Sowing*Genotype | spring | Vesuvio | 0 | . | . | . | . |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
Genotype | 5 | 230 | 9.78 | <.0001 |
Sowing | 1 | 5 | 20.84 | 0.0060 |
Sowing*Genotype | 5 | 230 | 6.50 | <.0001 |
fit_allenv_sowingrandomslope <- lmer(Yield ~ Genotype * Sowing + (1 + Sowing | Environment) + (1 | Sowing:Block:Environment),
data = fava)
This gets rid of the singular fit message. Let’s look at the random intercepts and slopes for environment.
ranef(fit_allenv_sowingrandomslope)$Environment
(Intercept) Sowingspring
badiola.2001-2002 0.3427129 -0.45877557
papiano.2001-2002 0.9006396 -0.21349718
badiola.2002-2003 -1.4271618 -0.07956543
papiano.2002-2003 -1.2176835 -0.30602556
badiola.2003-2004 0.6962424 0.38184826
papiano.2003-2004 0.7052505 0.67601548
This shows that the random slope for the spring sowing treatment is positive for the two 2003-2004 environments. When you look back at a plot of the raw data, this makes sense. The effect of spring sowing on yield is much less negative for those two environments. It is probably a good thing to include the sowing random slope term by environment.
Let’s be thorough and take a look at the model diagnostics for the model we are sticking with, that includes the sowing random slope but not genotype.
check_model(fit_allenv_sowingrandomslope, check = c('linearity', 'homogeneity', 'qq'))
We see good residual diagnostics across the board.
Let’s confirm our intuition that the random slope model is a good one
by comparing the models’ Akaike Information Criteria (AIC). In SAS, this
is output by default. In R, we can use the AIC()
function
to compare several models’ AIC at once.
AIC(fit_allenv, fit_allenv_noblock, fit_allenv_allrandomslopes, fit_allenv_sowingrandomslope)
df AIC
fit_allenv 16 432.2706
fit_allenv_noblock 15 430.2706
fit_allenv_allrandomslopes 42 413.9765
fit_allenv_sowingrandomslope 17 409.8349
We see that the lowest AIC (best fit) is the sowing random slope model, followed by the model including both random slopes. The random-intercept-only models are inferior.
PROTIP: AIC is a relative measure. It can only say what model is a better fit to the data out of the ones you are choosing from. It doesn’t say which one is “right.” For one thing, “all models are wrong,” and for another thing, we might be comparing a set of terrible models! Humility is important when you are doing stats!
A great innovation of the statisticians at SAS was the development of
least-square means (lsmeans
). Almost all the SAS model
fitting PROCs have a lsmeans
statement. You can use it to
get estimated values of treatment means at different levels of other
fixed effects, and averaged over the random variation. It gives you
confidence limits using an approximation method to estimate the degrees
of freedom, and lets you do pairwise comparisons with proper p-value
adjustment for multiple comparisons.
Statistician Russ Lenth from the University of Iowa implemented all of this same functionality, and then some, in the R package emmeans (estimated marginal means). Estimated marginal means is synonymous with least-square means (it may be a more appropriate term because least-square method isn’t always used to estimate them).
NOTE: I separated out the SAS code for
lsmeans
into a separateproc plm
which allows you to separate model fitting and post-fitting analyses. That’s a good practice to keep in mind as a SAS user. We had to include astore
statement in our call toproc glimmix
so the model object can be restored and used later inproc plm
.
proc plm restore = fit_allenv_sowingrandomslope;
lsmeans Genotype Sowing / diff linestable adjust = sidak;
slice Genotype * Sowing / sliceby = Sowing diff linestable adjust = sidak;
ods select lsmeans lsmlines slicediffs slicelines;
run;
NOTE: PROCEDURE PLM used (Total process time):
real time 0.96 seconds
cpu time 0.34 seconds
Genotype Least Squares Means | |||||
---|---|---|---|---|---|
Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| |
Chiaro | 3.2125 | 0.4719 | 230 | 6.81 | <.0001 |
Collameno | 2.8396 | 0.4719 | 230 | 6.02 | <.0001 |
Palomb | 3.0975 | 0.4719 | 230 | 6.56 | <.0001 |
Scuro | 3.3683 | 0.4719 | 230 | 7.14 | <.0001 |
Sicania | 3.1217 | 0.4719 | 230 | 6.62 | <.0001 |
Vesuvio | 2.9800 | 0.4719 | 230 | 6.31 | <.0001 |
Sidak Grouping for Genotype Least Squares Means (Alpha=0.05) | |||
---|---|---|---|
LS-means with the same letter are not significantly different. | |||
Genotype | Estimate | ||
Scuro | 3.3683 | A | |
A | |||
Chiaro | 3.2125 | B | A |
B | |||
Sicania | 3.1217 | B | |
B | |||
Palomb | 3.0975 | B | |
B | |||
Vesuvio | 2.9800 | B | C |
C | |||
Collameno | 2.8396 | C |
Sowing Least Squares Means | |||||
---|---|---|---|---|---|
Sowing | Estimate | Standard Error | DF | t Value | Pr > |t| |
autumn | 3.5468 | 0.4788 | 5 | 7.41 | 0.0007 |
spring | 2.6597 | 0.4788 | 5 | 5.55 | 0.0026 |
Sidak Grouping for Sowing Least Squares Means (Alpha=0.05) | ||
---|---|---|
LS-means with the same letter are not significantly different. | ||
Sowing | Estimate | |
autumn | 3.5468 | A |
spring | 2.6597 | B |
Simple Differences of Sowing*Genotype Least Squares Means Adjustment for Multiple Comparisons: Sidak |
||||||||
---|---|---|---|---|---|---|---|---|
Slice | Genotype | _Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | Adj P |
Sowing autumn | Chiaro | Collameno | 0.2608 | 0.1170 | 230 | 2.23 | 0.0268 | 0.3342 |
Sowing autumn | Chiaro | Palomb | -0.1771 | 0.1170 | 230 | -1.51 | 0.1315 | 0.8794 |
Sowing autumn | Chiaro | Scuro | -0.06000 | 0.1170 | 230 | -0.51 | 0.6086 | 1.0000 |
Sowing autumn | Chiaro | Sicania | 0.1262 | 0.1170 | 230 | 1.08 | 0.2817 | 0.9930 |
Sowing autumn | Chiaro | Vesuvio | 0.03167 | 0.1170 | 230 | 0.27 | 0.7869 | 1.0000 |
Sowing autumn | Collameno | Palomb | -0.4379 | 0.1170 | 230 | -3.74 | 0.0002 | 0.0034 |
Sowing autumn | Collameno | Scuro | -0.3208 | 0.1170 | 230 | -2.74 | 0.0066 | 0.0943 |
Sowing autumn | Collameno | Sicania | -0.1346 | 0.1170 | 230 | -1.15 | 0.2512 | 0.9870 |
Sowing autumn | Collameno | Vesuvio | -0.2292 | 0.1170 | 230 | -1.96 | 0.0514 | 0.5465 |
Sowing autumn | Palomb | Scuro | 0.1171 | 0.1170 | 230 | 1.00 | 0.3180 | 0.9968 |
Sowing autumn | Palomb | Sicania | 0.3033 | 0.1170 | 230 | 2.59 | 0.0101 | 0.1417 |
Sowing autumn | Palomb | Vesuvio | 0.2087 | 0.1170 | 230 | 1.78 | 0.0757 | 0.6930 |
Sowing autumn | Scuro | Sicania | 0.1862 | 0.1170 | 230 | 1.59 | 0.1128 | 0.8338 |
Sowing autumn | Scuro | Vesuvio | 0.09167 | 0.1170 | 230 | 0.78 | 0.4341 | 0.9998 |
Sowing autumn | Sicania | Vesuvio | -0.09458 | 0.1170 | 230 | -0.81 | 0.4197 | 0.9997 |
Sidak Grouping for Sowing*Genotype Least Squares Means Slice (Alpha=0.05) | ||||
---|---|---|---|---|
LS-means with the same letter are not significantly different. | ||||
Slice | Genotype | Estimate | ||
Sowing autumn | Palomb | 3.7542 | A | |
Sowing autumn | A | |||
Sowing autumn | Scuro | 3.6371 | B | A |
Sowing autumn | B | A | ||
Sowing autumn | Chiaro | 3.5771 | B | A |
Sowing autumn | B | A | ||
Sowing autumn | Vesuvio | 3.5454 | B | A |
Sowing autumn | B | A | ||
Sowing autumn | Sicania | 3.4508 | B | A |
Sowing autumn | B | |||
Sowing autumn | Collameno | 3.3162 | B |
Simple Differences of Sowing*Genotype Least Squares Means Adjustment for Multiple Comparisons: Sidak |
||||||||
---|---|---|---|---|---|---|---|---|
Slice | Genotype | _Genotype | Estimate | Standard Error | DF | t Value | Pr > |t| | Adj P |
Sowing spring | Chiaro | Collameno | 0.4850 | 0.1170 | 230 | 4.15 | <.0001 | 0.0007 |
Sowing spring | Chiaro | Palomb | 0.4071 | 0.1170 | 230 | 3.48 | 0.0006 | 0.0090 |
Sowing spring | Chiaro | Scuro | -0.2517 | 0.1170 | 230 | -2.15 | 0.0325 | 0.3909 |
Sowing spring | Chiaro | Sicania | 0.05542 | 0.1170 | 230 | 0.47 | 0.6362 | 1.0000 |
Sowing spring | Chiaro | Vesuvio | 0.4333 | 0.1170 | 230 | 3.70 | 0.0003 | 0.0040 |
Sowing spring | Collameno | Palomb | -0.07792 | 0.1170 | 230 | -0.67 | 0.5061 | 1.0000 |
Sowing spring | Collameno | Scuro | -0.7367 | 0.1170 | 230 | -6.30 | <.0001 | <.0001 |
Sowing spring | Collameno | Sicania | -0.4296 | 0.1170 | 230 | -3.67 | 0.0003 | 0.0045 |
Sowing spring | Collameno | Vesuvio | -0.05167 | 0.1170 | 230 | -0.44 | 0.6592 | 1.0000 |
Sowing spring | Palomb | Scuro | -0.6587 | 0.1170 | 230 | -5.63 | <.0001 | <.0001 |
Sowing spring | Palomb | Sicania | -0.3517 | 0.1170 | 230 | -3.01 | 0.0029 | 0.0432 |
Sowing spring | Palomb | Vesuvio | 0.02625 | 0.1170 | 230 | 0.22 | 0.8227 | 1.0000 |
Sowing spring | Scuro | Sicania | 0.3071 | 0.1170 | 230 | 2.62 | 0.0093 | 0.1302 |
Sowing spring | Scuro | Vesuvio | 0.6850 | 0.1170 | 230 | 5.85 | <.0001 | <.0001 |
Sowing spring | Sicania | Vesuvio | 0.3779 | 0.1170 | 230 | 3.23 | 0.0014 | 0.0211 |
Sidak Grouping for Sowing*Genotype Least Squares Means Slice (Alpha=0.05) | |||
---|---|---|---|
LS-means with the same letter are not significantly different. | |||
Slice | Genotype | Estimate | |
Sowing spring | Scuro | 3.0996 | A |
Sowing spring | A | ||
Sowing spring | Chiaro | 2.8479 | A |
Sowing spring | A | ||
Sowing spring | Sicania | 2.7925 | A |
Sowing spring | |||
Sowing spring | Palomb | 2.4408 | B |
Sowing spring | B | ||
Sowing spring | Vesuvio | 2.4146 | B |
Sowing spring | B | ||
Sowing spring | Collameno | 2.3629 | B |
We provide the emmeans()
function first with the name of
the fitted model object, then a one-sided formula with the fixed effects
that we want to calculate means for on the right-hand side of the
~
. If you want multiple effects crossed, use
+
. If you want to calculate the means for one treatment
within another, use |
for the grouping factor. (The
difference will be clear when we do pairwise contrasts).
emm_geno <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype)
NOTE: Results may be misleading due to involvement in interactions
emm_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Sowing)
NOTE: Results may be misleading due to involvement in interactions
emm_geno_x_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype + Sowing)
emm_geno_by_sowing <- emmeans(fit_allenv_sowingrandomslope, ~ Genotype | Sowing)
Type the name of the object to print a handy summary including 95% confidence intervals estimated with the Kenward-Roger method of degrees of freedom approximation.
emm_geno
Genotype emmean SE df lower.CL upper.CL
Chiaro 3.21 0.472 5.13 2.01 4.42
Collameno 2.84 0.472 5.13 1.64 4.04
Palomb 3.10 0.472 5.13 1.89 4.30
Scuro 3.37 0.472 5.13 2.16 4.57
Sicania 3.12 0.472 5.13 1.92 4.33
Vesuvio 2.98 0.472 5.13 1.78 4.18
Results are averaged over the levels of: Sowing
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
emm_sowing
Sowing emmean SE df lower.CL upper.CL
autumn 3.55 0.429 5 2.44 4.65
spring 2.66 0.524 5 1.31 4.01
Results are averaged over the levels of: Genotype
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
emm_geno_x_sowing
Genotype Sowing emmean SE df lower.CL upper.CL
Chiaro autumn 3.58 0.435 5.31 2.48 4.68
Collameno autumn 3.32 0.435 5.31 2.22 4.42
Palomb autumn 3.75 0.435 5.31 2.65 4.85
Scuro autumn 3.64 0.435 5.31 2.54 4.74
Sicania autumn 3.45 0.435 5.31 2.35 4.55
Vesuvio autumn 3.55 0.435 5.31 2.45 4.65
Chiaro spring 2.85 0.529 5.21 1.50 4.19
Collameno spring 2.36 0.529 5.21 1.02 3.71
Palomb spring 2.44 0.529 5.21 1.10 3.79
Scuro spring 3.10 0.529 5.21 1.76 4.44
Sicania spring 2.79 0.529 5.21 1.45 4.14
Vesuvio spring 2.41 0.529 5.21 1.07 3.76
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
emm_geno_by_sowing
Sowing = autumn:
Genotype emmean SE df lower.CL upper.CL
Chiaro 3.58 0.435 5.31 2.48 4.68
Collameno 3.32 0.435 5.31 2.22 4.42
Palomb 3.75 0.435 5.31 2.65 4.85
Scuro 3.64 0.435 5.31 2.54 4.74
Sicania 3.45 0.435 5.31 2.35 4.55
Vesuvio 3.55 0.435 5.31 2.45 4.65
Sowing = spring:
Genotype emmean SE df lower.CL upper.CL
Chiaro 2.85 0.529 5.21 1.50 4.19
Collameno 2.36 0.529 5.21 1.02 3.71
Palomb 2.44 0.529 5.21 1.10 3.79
Scuro 3.10 0.529 5.21 1.76 4.44
Sicania 2.79 0.529 5.21 1.45 4.14
Vesuvio 2.41 0.529 5.21 1.07 3.76
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
Next, do pairwise treatment contrasts. For the ones involving genotype, there is more than one comparison. Thus we adjust the p-values using the Sidak adjustment for multiple comparisons. Feel free to use your preferred method (Tukey, Holm, Bonferroni, etc.)
contrast(emm_geno, 'pairwise', adjust = 'sidak')
contrast estimate SE df t.ratio p.value
Chiaro - Collameno 0.3729 0.0827 230 4.508 0.0002
Chiaro - Palomb 0.1150 0.0827 230 1.390 0.9341
Chiaro - Scuro -0.1558 0.0827 230 -1.884 0.6102
Chiaro - Sicania 0.0908 0.0827 230 1.098 0.9917
Chiaro - Vesuvio 0.2325 0.0827 230 2.810 0.0777
Collameno - Palomb -0.2579 0.0827 230 -3.118 0.0304
Collameno - Scuro -0.5288 0.0827 230 -6.391 <.0001
Collameno - Sicania -0.2821 0.0827 230 -3.410 0.0115
Collameno - Vesuvio -0.1404 0.0827 230 -1.697 0.7609
Palomb - Scuro -0.2708 0.0827 230 -3.274 0.0182
Palomb - Sicania -0.0242 0.0827 230 -0.292 1.0000
Palomb - Vesuvio 0.1175 0.0827 230 1.420 0.9227
Scuro - Sicania 0.2467 0.0827 230 2.982 0.0466
Scuro - Vesuvio 0.3883 0.0827 230 4.694 0.0001
Sicania - Vesuvio 0.1417 0.0827 230 1.712 0.7495
Results are averaged over the levels of: Sowing
Degrees-of-freedom method: kenward-roger
P value adjustment: sidak method for 15 tests
contrast(emm_sowing, 'pairwise')
contrast estimate SE df t.ratio p.value
autumn - spring 0.887 0.194 5 4.565 0.0060
Results are averaged over the levels of: Genotype
Degrees-of-freedom method: kenward-roger
I won’t print the results of the pairwise contrasts for multiple factors here, but try these out to see how they differ.
contrast(emm_geno_x_sowing, 'pairwise', adjust = 'sidak')
contrast(emm_geno_by_sowing, 'pairwise', adjust = 'sidak')
You may also want to get multiple comparison letters to concisely
summarize the significant differences (this is done with the
lines
and/or linestable
options in the
lsmeans
statement in SAS). In R, the
multcomp package has a function called
cld()
, for “compact letter display.” The first argument to
cld()
is an emmeans
object. We also provide
our favorite multiple-comparison adjustment method with the
adjust
argument. The argument
Letters = letters
makes the group labels
a,b,c...
instead of 1,2,3...
.
cld(emm_geno, adjust = 'sidak', Letters = letters)
Genotype emmean SE df lower.CL upper.CL .group
Collameno 2.84 0.472 5.13 0.886 4.79 a
Vesuvio 2.98 0.472 5.13 1.026 4.93 ab
Palomb 3.10 0.472 5.13 1.144 5.05 b
Sicania 3.12 0.472 5.13 1.168 5.08 b
Chiaro 3.21 0.472 5.13 1.259 5.17 bc
Scuro 3.37 0.472 5.13 1.415 5.32 c
Results are averaged over the levels of: Sowing
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
Conf-level adjustment: sidak method for 6 estimates
P value adjustment: sidak method for 15 tests
significance level used: alpha = 0.05
NOTE: If two or more means share the same grouping symbol,
then we cannot show them to be different.
But we also did not show them to be the same.
cld(emm_geno_by_sowing, adjust = 'sidak', Letters = letters)
Sowing = autumn:
Genotype emmean SE df lower.CL upper.CL .group
Collameno 3.32 0.435 5.31 1.545 5.09 a
Sicania 3.45 0.435 5.31 1.680 5.22 ab
Vesuvio 3.55 0.435 5.31 1.775 5.32 ab
Chiaro 3.58 0.435 5.31 1.806 5.35 ab
Scuro 3.64 0.435 5.31 1.866 5.41 ab
Palomb 3.75 0.435 5.31 1.983 5.53 b
Sowing = spring:
Genotype emmean SE df lower.CL upper.CL .group
Collameno 2.36 0.529 5.21 0.188 4.54 a
Vesuvio 2.41 0.529 5.21 0.240 4.59 a
Palomb 2.44 0.529 5.21 0.266 4.62 a
Sicania 2.79 0.529 5.21 0.618 4.97 b
Chiaro 2.85 0.529 5.21 0.673 5.02 b
Scuro 3.10 0.529 5.21 0.925 5.27 b
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
Conf-level adjustment: sidak method for 6 estimates
P value adjustment: sidak method for 15 tests
significance level used: alpha = 0.05
NOTE: If two or more means share the same grouping symbol,
then we cannot show them to be different.
But we also did not show them to be the same.
PROTIP: You can call the function
plot()
and pass any of youremmeans
orcontrast
objects to it as an argument. There are built-in methods that create plots of the means or contrasts, and their 95% confidence intervals. I will not get into that here. In future lessons we will take a closer look at those plots and how to customize them.
What have you learned in this lesson?
Impressive!
This was Lesson 2 in a series. Future lessons will cover:
As in the previous lesson, here are some exercises for additional practice. These exercises give you some SAS code and ask you to write some R code that does the same job or close to it.
For these exercises, we will be using a classic dataset from the notes of Gertrude Cox, a pioneering statistician who incidentally was the first female professor at N.C. State University! This dataset comes from a strip-split-plot experiment examining how soil type, fertilizer, and calcium affect barley yield.
The barley dataset is provided with the R package
agridat. You can load it into your R workspace by
calling the function data()
after loading the package.
library(agridat)
data('cox.stripsplit')
In the experiment, there are four replicated blocks. Within each block, there are four vertical strips each with a different fertilizer treatment. Each strip is split in half to yield two subplots per strip, each with a different level of calcium. Perpendicularly to the fertilizer strips there are horizontal strips with different soil types.
To help visualize, this is what one of the four blocks might look like. Note that the fertilizer (F, shown as different colors) is applied in vertical strips in a random order and the soil type (S, shown as different shadings) is applied perpendicularly. Within each plot formed by the intersection of two strips, there is a subplot for each calcium (C) treatment, arranged randomly. The other three blocks would look similar but with different randomized locations of F, S, and C levels.
More details can be found in the help documentation for the dataset.
Type ?cox.stripsplit
.
Specify the model with the fixed and random effects we identified in
Exercise 1. Include only random intercept effects, no random slopes. The
model
and random
statements to fit the same
model using SAS proc glimmix
are:
model yield = soil|fert|calcium / ddfm=kr;
random rep fert(rep) soil(rep) fert(soil*rep);
*
, which will estimate main effects
and all interactions.(1 | <something>)
.Did you get any warning messages when fitting the model? Can you diagnose the issue? What, if anything, should you do about it?
Fit the model again, this time adding a random slope for the main
effect of calcium, grouped by rep. This corresponds to the following two
random
statements in proc glimmix
:
random fert(rep) soil(rep);
random intercept calcium / subject=rep;
+ calcium
somewhere in the random effects part of the model formula.What happens when you do this? Can you diagnose the issue?
Write code that gets a similar result as you would get if you
inserted the following slice
statement into your SAS
proc glimmix
code, for the model with only random
intercepts.
slice fert*calcium / sliceby=calcium diff cl linestable adjust=tukey adjdfe=row;
Your code should do the following:
emmeans()
, contrast()
, and
cld()
.adjust = 'tukey'
in two of the three function
calls.