Lesson 3: From linear model to linear mixed model

Lesson 3 learning objectives

At the end of this lesson, students will …

Understand the difference between a linear model and a linear mixed model.
Understand the difference between random intercepts and random slopes.
Be able to fit a mixed model with fixed effects, both continuous and categorical, and interactions.

Be able to fit a mixed model with random effects, both random intercepts and random slopes.
Examine diagnostic plots to ensure that model assumptions are met.
Examine summary information from mixed model output, including ANOVA tables.

Load the packages we need

library(tidyverse)
library(lme4)
library(lmerTest)
library(easystats)

Read in some example data

For this lesson, we will keep it simple and use simulated data
Read the soilN_biomass.csv file to a data frame called soilN_biomass
Simulates a study with five different fields with 20 plots each
In each plot, soil nitrogen and plant biomass is measured

soilN_biomass <- read_csv('https://usda-ree-ars.github.io/SEAStats/mixed_models_in_R/datasets/soilN_biomass.csv')

Examine the data

Use functions you learned in lesson 2

soilN_biomass
summary(soilN_biomass)
glimpse(soilN_biomass)

field: character column to identify which field the measurement comes from
soilN: numeric column for the soil N value
biomass: numeric column for the plant biomass value

Plot the data

Plot with ggplot() (Don’t worry about the code used to make this plot for now.)

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw()

The relationship between soil N and plant biomass is sort of unclear
How can we fit a model to this?

Model Syntax in R

At minimum, you need
- model formula
- data frame that the variables in the model formula come from
Formula has two sides: left and right, separated by tilde ~
- LHS ~ RHS

LHS and RHS

Left-hand side is the response variable (y)
Right-hand side contains the predictor variables (x)
Effects are separated by a plus sign +
Interaction effects are denoted with a colon :
The shorthand * indicates all possible combinations of interactions for a set of variables (I don’t recommend)

Example of a formula

weight ~ height + sex + height:sex
In SAS this would be model weight = height sex height*sex
“The expected value of weight is a linear combination of height, sex, and the product of height and sex.”

Intercept

An intercept is included by default
You can be explicit by adding 1 to the RHS
- weight ~ 1 + height
To fit a model without an intercept (force regression line through origin) replace 1 with 0
- weight ~ 0 + height

Linear model with continuous predictor

Relationship between soil nitrogen (x) and plant biomass (y) in a plot
Fit linear regression with lm() (linear model)
Model formula is just y ~ x
data argument is the data frame containing columns biomass and soilN

lm_fit <- lm(biomass ~ soilN, data = soilN_biomass)

Look at model output

Use summary()

summary(lm_fit)

Positive coefficient (~1.34) on soilN
For every 1 unit increase in soil N, there is a 1.34 unit increase in biomass

If you just want the coefficients you can use the coef() function:

lm_coefs <- coef(lm_fit)

Plot fitted trendline from regression

Same plot as before but with a line using intercept and slope from model

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw() +
  geom_abline(intercept = lm_coefs[1], slope = lm_coefs[2], size = 1)

Relatively poor fit – Why?

Linear regression assumes a single intercept and slope holds across all fields
But fields might have different unmeasured environmental conditions causing different relationships
Some fields might have higher soil phosphorus or better soil moisture conditions
Or just environmental “noise” with no discernible cause

There is a bigger problem with the simple linear regression
The 20 measurements that come from the same field are not independent because they have a common cause
The regression model has 98 degrees of freedom
- 100 data points - 2 parameters, the intercept and slope = 98
But if we use one average value from each field we will only have 3 degrees of freedom
Ideally we want something in between
Measurements from the same field are partially but not completely dependent on one another

Separate linear regressions

So why not fit separate models for each field?
This code is just for demonstration: fit five separate regressions, get intercept and slope for each field, then plot the lines

separate_lm_fits <- soilN_biomass %>%
  group_by(field) %>%
  group_map(~ lm(biomass ~ soilN, data = .))

separate_lm_coefs <- data.frame(field = c('a', 'b', 'c', 'd', 'e'), 
                                intercept = map_dbl(separate_lm_fits, ~ coef(.)[1]),
                                slope = map_dbl(separate_lm_fits, ~ coef(.)[2]))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw() +
  geom_abline(aes(intercept = intercept, slope = slope, color = field), size = 1, data = separate_lm_coefs)

We now have a separate relationship for each field
But we might hypothesize that there is some overall tendency of soil N to affect plant biomass
This process occurs in every field but has different outcomes because of the different environments in each field
How do we account for this?

Our first linear mixed model

Goal of our scientific inference is to generalize what we find in a particular study to a larger population
Separate regressions for each field doesn’t generalize because we aren’t estimating the overall relationship
Linear mixed models (LMM) estimate the overall relationship and account for unknown variation between fields at the same time

What do you call this model?

Mixed models meme courtesy of Chelsea Parlett-Pelleriti on Twitter

Partial pooling

complete pooling in simple linear regression (all observations share intercept and slope parameters)
no pooling when we did separate regressions for each field (separate intercept and separate slope for each field)
Mixed models do partial pooling
- some parameters shared by all groups (the fixed effects)
- some parameters are unique to each group (the random effects)
- called “mixed” model because there’s a “mix” of fixed and random effects

Random intercept model

We will use lme4 package to fit mixed models
lmer() instead of lm()

mm_fit <- lmer(biomass ~ soilN + (1 | field), data = soilN_biomass)

+ (1 | field) – what’s that new thing?
A separate random intercept will be fit to each field

Syntax of random effect term

(1 | field)

Contained within parentheses ()
Has a vertical bar | in the middle of it.
Left of | is the design component (1 indicates an intercept)
Right of | are the grouping factors (here it is only field)

Visualizing model fit of random intercept model

Plot model fit with ggplot2 (again don’t worry about code)

pred_grid <- expand.grid(soilN = c(0, 10), field = letters[1:5])
mm_pred_group <- cbind(pred_grid, biomass = predict(mm_fit, newdata = pred_grid))
mm_pred_population <- data.frame(soilN = c(0, 10), biomass = predict(mm_fit, newdata = data.frame(soilN = c(0, 10)), re.form = ~ 0))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw() +
  geom_line(data = mm_pred_group) +
  geom_line(data = mm_pred_population, color = 'black', size = 1)

Model now accounts for different “baseline” biomass levels in each field with different intercepts
Thick black line is population-level expectation for the “average” field
Mixed model treats intercepts as if they were drawn from a normal distribution
Population-level intercept is the mean of that distribution
But lines are all parallel (slope is the same for each field)

Random slope model

Slopes can also vary by field
Change the design (left-hand side of random effects term) to (soilN | field)
Slope with respect to soilN (and intercept) will be different for each field

mm_fit_randomslopes <- lmer(biomass ~ soilN + (soilN | field), data = soilN_biomass)

Visualizing model fit of random slope model

mm_pred_group_randomslopes <- cbind(pred_grid, biomass = predict(mm_fit_randomslopes, newdata = pred_grid))
mm_pred_population_randomslopes <- data.frame(soilN = c(0, 10), biomass = predict(mm_fit_randomslopes, newdata = data.frame(soilN = c(0, 10)), re.form = ~ 0))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw() +
  geom_line(data = mm_pred_group_randomslopes) +
  geom_line(data = mm_pred_population_randomslopes, color = 'black', size = 1)

Group-level predictions for each field now have different slopes and intercepts

Delving into mixed model output

Diagnostic plots
Model summaries

Diagnostic plots

Does the statistical model meet all assumptions?
Residuals need to be roughly normally distributed and have no trend with respect to the fitted values
Random effects need to be normally distributed
check_model() from easystats package produces all the diagnostic plots you need

check_model(mm_fit_randomslopes)

Residual diagnostics

The plots look great! (because this is fake data)
Linearity and homogeneity of variance plots show no trend
Normal Q-Q plots for the overall residuals (bottom left) and the random effects (bottom right) both straight lines

I recommend against formal tests on residuals, e.g. Shapiro-Wilk test
Assumptions like “normal residuals” are only approximations
A perfect normal distribution does not exist in nature, just like a perfect circle does not exist
Just has to be “close enough” for the model to be valid

The most important assumption of a linear model

Are individual data points independent samples?
Or if not, did you account for any dependence between them in your model?
Mixed models allow us to account for the partial dependency of data points

Model summaries

summary() of the fitted model object gives us a lot of output
- Distributions of residuals and random effects
- Fixed effect coefficients with t-tests

summary(mm_fit_randomslopes)

Getting parameters from the fit object

ranef(): random effects
fixef(): fixed effects
coef(): coefficients
- Add the random effect for each group to the fixed effect
- Result is the group-level intercepts and slopes

ranef(mm_fit_randomslopes)
fixef(mm_fit_randomslopes)
coef(mm_fit_randomslopes)

ANOVA

Use the function anova()
We need the lmerTest package for this

anova(mm_fit_randomslopes)

Degrees of freedom for the F-test are estimated with an approximation
Because linear mixed models’ degrees of freedom are “somewhere in between”

Mixed models with categorical predictor

Categorical predictor such as treatment variable with a control and one or more treatment levels
New example data: still simulated but now each field has ten unfertilized control plots, ten low fertilizer plots, and ten high fertilizer plots
Read data and then sort treatment factor levels to a logical order

fert_biomass <- read_csv('https://usda-ree-ars.github.io/SEAStats/mixed_models_in_R/datasets/fertilizer_biomass.csv') %>%
  mutate(treatment = factor(treatment, levels = c('control', 'low', 'high')))

Look at the data

Box plot for each treatment within each field using ggplot2

ggplot(fert_biomass, aes(x = field, fill = treatment, y = biomass)) +
  geom_boxplot() + theme_bw()

A lot of variation among fields
Biomass means tend to increase from control to low to high.

Fit a model

Try the random intercept model first

fert_mm <- lmer(biomass ~ treatment + (1 | field), data = fert_biomass)

What does the model summary show us?

summary(fert_mm)

Three coefficients in the model: the intercept, treatmentlow, and treatmenthigh
Three treatments (control, low, and high) but only two have their own coefficient
This is because the mean value of one of the three treatments is set as the intercept
We only need \(n-1\) treatment coefficients for \(n\) levels if we have an intercept

Expected mean value of biomass in unfertilized control plots = ~18.1
Coefficients on low and high are positive, so the expected biomass is greater than the control
Expected value of biomass in the low fertilizer treatment = 18.1 + 2.5 = ~20.6

There are built-in t-tests for each coefficient, comparing them to 0
The t-test for the intercept is not really meaningful
The other two t-tests are testing the null hypothesis that the difference between each other treatment and the control is zero
But they don’t account for multiple comparisons
And there is no t-test showing whether low and high differ from one another
Typically do not report these default t-tests

Random slopes model with categorical predictor

fert_mm_slopes <- lmer(biomass ~ treatment + (treatment | field), data = fert_biomass)

We get a message that the fit is singular – what?!?

Singular fit

“Some components of the variance-covariance matrix of the random effects, or possibly one or more linear combinations of them, are either exactly zero or exactly one.”
Still … what?!?
The algorithm that fits the model parameters doesn’t have enough data to get a good estimate
Often occurs when fitting (overly) complex models
Or random effects are very small and cannot be distinguished from zero
We still get output (it’s a note, not an error or warning)
But we need to take a close look at the random effects and their variances

Random slopes look fine

ranef(fert_mm_slopes)

Look at the variance-covariance matrix of the random effects using VarCorr()

VarCorr(fert_mm_slopes)

Correlations between the random intercepts and the random slopes are exactly 1 or exactly -1
This means no additional information is being provided by the random slopes
We should just stick with the random intercepts model in this case
The random slopes model is probably still statistically valid, just a little too complex for the data
This is not surprising here because I simulated the data to have no difference in slopes between fields

Day 1 recap

What have we done so far?

Learned the basics of R
Learned how to work with data frames in R
Fit a linear model
Fit linear mixed models with random intercepts and random slopes

Impressive!

What will we do next?

Data transformations in linear mixed models
Crossed and nested random effects
Generalized linear mixed models
How to compare means and test specific hypotheses with contrasts