Lesson 4: Going further with linear mixed models

Lesson 4 learning objectives

At the end of this lesson, students will …

Fit a linear mixed model with crossed and nested random effects.
Fit a linear mixed model with a transformed response variable.

Crossed and nested random effects

We are finally going to use some real data!
Dataset of results from a study of Precision Zonal Management (PZM) conservation agriculture practices
Two experimental treatments
- two different tillage systems
- crossed with presence or absence of rye cover crop
Measured maize yield

Load needed packages and read the data.

library(tidyverse)
library(easystats)
library(lme4)
library(lmerTest)

maize <- read_csv('https://usda-ree-ars.github.io/SEAStats/mixed_models_in_R/datasets/pzm_maize.csv')

Examine the data

glimpse(maize)

Multiple locations (US states) and multiple years
plot and block columns have integers identifying the plots and blocks
tillage column has two different values, "cp" (chisel plow) and "rt" (ridge tillage)
cover.trt column also has two different values, "none" and "rye"

Create factor columns

Convert the yr, plot, and block columns to factor type
They are integers and we want to treat them as categorical for the model
Use mutate() with the pipe %>%

maize <- maize %>%
  mutate(
    yr = factor(yr),
    plot = factor(plot),
    block = factor(block)
  )

Boxplot of the data

Use ggplot (don’t worry about this code for now)

ggplot(maize, aes(y = yield, x = tillage, fill = cover.trt)) +
  facet_grid(location ~ yr) +
  geom_boxplot(position = position_dodge()) +
  theme_bw()

In some location-year combinations, the tillage treatments seem to result in different yields
Mostly minimal differences between the cover treatments
The majority of the variation is between locations and years.

Setting up the model

Fixed effect for tillage treatment
Fixed effect for cover treatment
Interaction between the two treatments

yield ~ tillage + cover.trt + tillage:cover.trt

Adding random effects

Should location and year be random or fixed?
Depends on the statistical inference we want to draw
If we are interested in change over time, year should be a fixed effect
If are not interested in comparing means of the different years, year should be a random effect
Treating years as a random population of environments
The same goes for location

Building model formula

Let’s say year and location will both be random effects
Fit them as crossed random effects
An intercept will be fit for each year and each location
This means we assume the effect of year is constant across locations, and vice versa

yield ~ tillage + cover.trt + tillage:cover.trt + (1 | yr) + (1 | location)

No random slopes (baseline yield differs across environments but not the treatment effect)

Building model formula some more

There are many blocks at each location, each of which has many plots
Block treated as a random effect nested within location
Block 1 in Illinois is not the same as Block 1 in Minnesota
Separate intercept fitted for each block in each location
We do not need to include plot explicitly, because the plots are individual data points
Use the : operator to separate nesting terms.

yield ~ tillage + cover.trt + tillage:cover.trt + (1 | yr) + (1 | location) + (1 | block:location)

Full model formula

fit_maize <- lmer(yield ~ tillage + cover.trt + tillage:cover.trt + (1 | yr) + (1 | location) + (1 | block:location), data = maize)

Alternative formula 1

Give each location-year combination its own intercept
This would mean location is nested within year as well
Three levels of nesting for block
Use this if the locations are far enough apart that the random variation of years isn’t consistent across locations

yield ~ tillage + cover.trt + tillage:cover.trt + (1 | location:yr) + (1 | block:location:yr)

Alternative formula 2

Random slopes
Must specify which effects are random in the design component
Specify which grouping factors get separate slopes fitted to them
For example, fit random slopes only to each main effect and only for the year-location grouping factor:

yield ~ tillage + cover.trt + tillage:cover.trt + (tillage + cover.trt | location:yr) + (1 | block:location:yr)

Data transformations

Not just to “make the data normal” (actually the residuals)
Transformations should be guided by theory and science
For example, response variables that are ratios or concentrations
Difference between two values on a logarithmic scale is multiplicative difference (ratio) on an arithmetic scale
Example: Difference between a ratio of 0.01 and a ratio of 1 is 100 times
Difference etween a ratio of 1 and 100 is 100 times
On a log scale those two differences are treated the same (which is good)

Log transformation example

Small subset of an ARS scientist’s data (noise added)
Different varieties of barley crossed with different types of fungicide
Multiple plots of each combination of variety and fungicide in each experimental block (column Rep)
Amount of DON (a mycotoxin) produced measured, units of parts per million (column DON_ppm)

barley_don <- read_csv('https://usda-ree-ars.github.io/SEAStats/mixed_models_in_R/datasets/barley_don.csv')

Examine the data

glimpse(barley_don)

ggplot(barley_don, aes(x = Variety, y = DON_ppm, fill = Fungicide)) +
  geom_boxplot(position = position_dodge()) +
  theme_bw()

Fit a model

Variety, fungicide, and interaction are the fixed effects
Each Rep gets a random intercept

fit_don <- lmer(DON_ppm ~ Variety + Fungicide + Variety:Fungicide + (1|Rep), data = barley_don)

Log transformed response

Same as before but the response variable is log-transformed
This can be done directly in the model formula
Concentration is on a ratio scale so we are looking at multiplicative changes

fit_don_log <- lmer(log(DON_ppm) ~ Variety + Fungicide + Variety:Fungicide + (1|Rep), data = barley_don)

Improvement to residuals

Transformations should be determined by the science, but it’s still good to look at the residuals
Residuals in untransformed model have non-homogeneous variance and bad match to normal quantile line
Both diagnostics better for the log-transformed response
Note: check argument to check_model() lets you select specific diagnostic plots to make
- See all options by looking at help documentation: ?check_model

check_model(fit_don, check = c('homogeneity', 'qq'))
check_model(fit_don_log, check = c('homogeneity', 'qq'))