Lesson 3: From linear model to linear mixed model

Lesson 3 learning objectives

At the end of this lesson, students will …

  • Understand the difference between a linear model and a linear mixed model.
  • Understand the difference between random intercepts and random slopes.
  • Be able to fit a mixed model with fixed effects, both continuous and categorical, and interactions.
  • Be able to fit a mixed model with random effects, both random intercepts and random slopes.
  • Examine diagnostic plots to ensure that model assumptions are met.
  • Examine summary information from mixed model output, including ANOVA tables.

Load the packages we need

library(tidyverse)
library(lme4)
library(lmerTest)
library(easystats)

Read in some example data

  • For this lesson, we will keep it simple and use simulated data
  • Read the soilN_biomass.csv file to a data frame called soilN_biomass
  • Simulates a study with five different fields with 20 plots each
  • In each plot, soil nitrogen and plant biomass is measured
soilN_biomass <- read_csv('https://usda-ree-ars.github.io/glmm-workshop-dec2022/datasets/soilN_biomass.csv')

Examine the data

  • Use functions you learned in lesson 2
soilN_biomass
summary(soilN_biomass)
glimpse(soilN_biomass)
  • field: character column to identify which field the measurement comes from
  • soilN: numeric column for the soil N value
  • biomass: numeric column for the plant biomass value

Plot the data

  • Plot with ggplot() (Don’t worry about the code used to make this plot for now.)
ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw()
  • The relationship between soil N and plant biomass is sort of unclear
  • How can we fit a model to this?

Model Syntax in R

  • At minimum, you need
    • model formula
    • data frame that the variables in the model formula come from
  • Formula has two sides: left and right, separated by tilde ~
    • LHS ~ RHS

LHS and RHS

  • Left-hand side is the response variable (y)
  • Right-hand side contains the predictor variables (x)
  • Effects are separated by a plus sign +
  • Interaction effects are denoted with a colon :
  • The shorthand * indicates all possible combinations of interactions for a set of variables (I don’t recommend)

Example of a formula

  • weight ~ height + sex + height:sex
  • In SAS this would be model weight = height sex height*sex
  • “The expected value of weight is a linear combination of height, sex, and the product of height and sex.”

Intercept

  • An intercept is included by default
  • You can be explicit by adding 1 to the RHS
    • weight ~ 1 + height
  • To fit a model without an intercept (force regression line through origin) replace 1 with 0
    • weight ~ 0 + height

Linear model with continuous predictor

  • Relationship between soil nitrogen (x) and plant biomass (y) in a plot
  • Fit linear regression with lm() (linear model)
  • Model formula is just y ~ x
  • data argument is the data frame containing columns biomass and soilN
lm_fit <- lm(biomass ~ soilN, data = soilN_biomass)

Look at model output

  • Use summary()
summary(lm_fit)
  • Positive coefficient (~1.34) on soilN
  • For every 1 unit increase in soil N, there is a 1.34 unit increase in biomass

If you just want the coefficients you can use the coef() function:

lm_coefs <- coef(lm_fit)

Plot fitted trendline from regression

  • Same plot as before but with a line using intercept and slope from model
ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw() +
  geom_abline(intercept = lm_coefs[1], slope = lm_coefs[2], size = 1)
  • Relatively poor fit – Why?
  • Linear regression assumes a single intercept and slope holds across all fields
  • But fields might have different unmeasured environmental conditions causing different relationships
  • Some fields might have higher soil phosphorus or better soil moisture conditions
  • Or just environmental “noise” with no discernible cause
  • There is a bigger problem with the simple linear regression
  • The 20 measurements that come from the same field are not independent because they have a common cause
  • The regression model has 98 degrees of freedom
    • 100 data points - 2 parameters, the intercept and slope = 98
  • But if we use one average value from each field we will only have 3 degrees of freedom
  • Ideally we want something in between
  • Measurements from the same field are partially but not completely dependent on one another

Separate linear regressions

  • So why not fit separate models for each field?
  • This code is just for demonstration: fit five separate regressions, get intercept and slope for each field, then plot the lines
separate_lm_fits <- soilN_biomass %>%
  group_by(field) %>%
  group_map(~ lm(biomass ~ soilN, data = .))

separate_lm_coefs <- data.frame(field = c('a', 'b', 'c', 'd', 'e'), 
                                intercept = map_dbl(separate_lm_fits, ~ coef(.)[1]),
                                slope = map_dbl(separate_lm_fits, ~ coef(.)[2]))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw() +
  geom_abline(aes(intercept = intercept, slope = slope, color = field), size = 1, data = separate_lm_coefs)
  • We now have a separate relationship for each field
  • But we might hypothesize that there is some overall tendency of soil N to affect plant biomass
  • This process occurs in every field but has different outcomes because of the different environments in each field
  • How do we account for this?

Our first linear mixed model

  • Goal of our scientific inference is to generalize what we find in a particular study to a larger population
  • Separate regressions for each field doesn’t generalize because we aren’t estimating the overall relationship
  • Linear mixed models (LMM) estimate the overall relationship and account for unknown variation between fields at the same time

What do you call this model?

Mixed models meme courtesy of Chelsea Parlett-Pelleriti on Twitter

Partial pooling

  • complete pooling in simple linear regression (all observations share intercept and slope parameters)
  • no pooling when we did separate regressions for each field (separate intercept and separate slope for each field)
  • Mixed models do partial pooling
    • some parameters shared by all groups (the fixed effects)
    • some parameters are unique to each group (the random effects)
    • called “mixed” model because there’s a “mix” of fixed and random effects

Random intercept model

  • We will use lme4 package to fit mixed models
  • lmer() instead of lm()
mm_fit <- lmer(biomass ~ soilN + (1 | field), data = soilN_biomass)
  • + (1 | field) – what’s that new thing?
  • A separate random intercept will be fit to each field

Syntax of random effect term

(1 | field)

  • Contained within parentheses ()
  • Has a vertical bar | in the middle of it.
  • Left of | is the design component (1 indicates an intercept)
  • Right of | are the grouping factors (here it is only field)

Visualizing model fit of random intercept model

Plot model fit with ggplot2 (again don’t worry about code)

pred_grid <- expand.grid(soilN = c(0, 10), field = letters[1:5])
mm_pred_group <- cbind(pred_grid, biomass = predict(mm_fit, newdata = pred_grid))
mm_pred_population <- data.frame(soilN = c(0, 10), biomass = predict(mm_fit, newdata = data.frame(soilN = c(0, 10)), re.form = ~ 0))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw() +
  geom_line(data = mm_pred_group) +
  geom_line(data = mm_pred_population, color = 'black', size = 1)
  • Model now accounts for different “baseline” biomass levels in each field with different intercepts
  • Thick black line is population-level expectation for the “average” field
  • Mixed model treats intercepts as if they were drawn from a normal distribution
  • Population-level intercept is the mean of that distribution
  • But lines are all parallel (slope is the same for each field)

Random slope model

  • Slopes can also vary by field
  • Change the design (left-hand side of random effects term) to (soilN | field)
  • Slope with respect to soilN (and intercept) will be different for each field
mm_fit_randomslopes <- lmer(biomass ~ soilN + (soilN | field), data = soilN_biomass)

Visualizing model fit of random slope model

mm_pred_group_randomslopes <- cbind(pred_grid, biomass = predict(mm_fit_randomslopes, newdata = pred_grid))
mm_pred_population_randomslopes <- data.frame(soilN = c(0, 10), biomass = predict(mm_fit_randomslopes, newdata = data.frame(soilN = c(0, 10)), re.form = ~ 0))

ggplot(soilN_biomass, aes(x = soilN, y = biomass, color = field)) +
  geom_point(size = 1.5) +
  theme_bw() +
  geom_line(data = mm_pred_group_randomslopes) +
  geom_line(data = mm_pred_population_randomslopes, color = 'black', size = 1)
  • Group-level predictions for each field now have different slopes and intercepts

Delving into mixed model output

  • Diagnostic plots
  • Model summaries

Diagnostic plots

  • Does the statistical model meet all assumptions?
  • Residuals need to be roughly normally distributed and have no trend with respect to the fitted values
  • Random effects need to be normally distributed
  • check_model() from easystats package produces all the diagnostic plots you need
check_model(mm_fit_randomslopes)

Residual diagnostics

  • The plots look great! (because this is fake data)
  • Linearity and homogeneity of variance plots show no trend
  • Normal Q-Q plots for the overall residuals (bottom left) and the random effects (bottom right) both straight lines
  • I recommend against formal tests on residuals, e.g. Shapiro-Wilk test
  • Assumptions like “normal residuals” are only approximations
  • A perfect normal distribution does not exist in nature, just like a perfect circle does not exist
  • Just has to be “close enough” for the model to be valid

The most important assumption of a linear model

  • Are individual data points independent samples?
  • Or if not, did you account for any dependence between them in your model?
  • Mixed models allow us to account for the partial dependency of data points

Model summaries

  • summary() of the fitted model object gives us a lot of output
    • Distributions of residuals and random effects
    • Fixed effect coefficients with t-tests
summary(mm_fit_randomslopes)

Getting parameters from the fit object

  • ranef(): random effects
  • fixef(): fixed effects
  • coef(): coefficients
    • Add the random effect for each group to the fixed effect
    • Result is the group-level intercepts and slopes
ranef(mm_fit_randomslopes)
fixef(mm_fit_randomslopes)
coef(mm_fit_randomslopes)

ANOVA

  • Use the function anova()
  • We need the lmerTest package for this
anova(mm_fit_randomslopes)
  • Degrees of freedom for the F-test are estimated with an approximation
  • Because linear mixed models’ degrees of freedom are “somewhere in between”

Mixed models with categorical predictor

  • Categorical predictor such as treatment variable with a control and one or more treatment levels
  • New example data: still simulated but now each field has ten unfertilized control plots, ten low fertilizer plots, and ten high fertilizer plots
  • Read data and then sort treatment factor levels to a logical order
fert_biomass <- read_csv('https://usda-ree-ars.github.io/glmm-workshop-dec2022/datasets/fertilizer_biomass.csv') %>%
  mutate(treatment = factor(treatment, levels = c('control', 'low', 'high')))

Look at the data

  • Box plot for each treatment within each field using ggplot2
ggplot(fert_biomass, aes(x = field, fill = treatment, y = biomass)) +
  geom_boxplot() + theme_bw()
  • A lot of variation among fields
  • Biomass means tend to increase from control to low to high.

Fit a model

  • Try the random intercept model first
fert_mm <- lmer(biomass ~ treatment + (1 | field), data = fert_biomass)
  • What does the model summary show us?
summary(fert_mm)
  • Three coefficients in the model: the intercept, treatmentlow, and treatmenthigh
  • Three treatments (control, low, and high) but only two have their own coefficient
  • This is because the mean value of one of the three treatments is set as the intercept
  • We only need \(n-1\) treatment coefficients for \(n\) levels if we have an intercept
  • Expected mean value of biomass in unfertilized control plots = ~18.1
  • Coefficients on low and high are positive, so the expected biomass is greater than the control
  • Expected value of biomass in the low fertilizer treatment = 18.1 + 2.5 = ~20.6
  • There are built-in t-tests for each coefficient, comparing them to 0
  • The t-test for the intercept is not really meaningful
  • The other two t-tests are testing the null hypothesis that the difference between each other treatment and the control is zero
  • But they don’t account for multiple comparisons
  • And there is no t-test showing whether low and high differ from one another
  • Typically do not report these default t-tests

Random slopes model with categorical predictor

fert_mm_slopes <- lmer(biomass ~ treatment + (treatment | field), data = fert_biomass)
  • We get a message that the fit is singular – what?!?

Singular fit

  • “Some components of the variance-covariance matrix of the random effects, or possibly one or more linear combinations of them, are either exactly zero or exactly one.”
  • Still … what?!?
  • The algorithm that fits the model parameters doesn’t have enough data to get a good estimate
  • Often occurs when fitting (overly) complex models
  • Or random effects are very small and cannot be distinguished from zero
  • We still get output (it’s a note, not an error or warning)
  • But we need to take a close look at the random effects and their variances
  • Random slopes look fine
ranef(fert_mm_slopes)
  • Look at the variance-covariance matrix of the random effects using VarCorr()
VarCorr(fert_mm_slopes)
  • Correlations between the random intercepts and the random slopes are exactly 1 or exactly -1
  • This means no additional information is being provided by the random slopes
  • We should just stick with the random intercepts model in this case
  • The random slopes model is probably still statistically valid, just a little too complex for the data
  • This is not surprising here because I simulated the data to have no difference in slopes between fields

Day 1 recap

What have we done so far?

  • Learned the basics of R
  • Learned how to work with data frames in R
  • Fit a linear model
  • Fit linear mixed models with random intercepts and random slopes

Impressive!

What will we do tomorrow?

  • Data transformations in linear mixed models
  • Crossed and nested random effects
  • Generalized linear mixed models
  • How to compare means and test specific hypotheses with contrasts

Excited to see you all then!!!