Power Analysis Tutorial

Why learn power analysis?

  • The single most-requested topic for SEA stats training
  • Many scientists are required to do power analysis and view it as an annoyance
  • But power analysis is critical when designing studies (experimental or observational) that give us reliable answers to questions we care about
  • If we can’t do that, we can’t do science

Motivation to learn power analysis

When done properly, power analysis:

  • Helps you design your study, foreseeing any issues that might arise
  • Reduces the chance of doing weak and inconclusive studies that are doomed before they even start
  • Elevates your science, and your science makes the world a better place

If that doesn’t motivate you to learn about power analysis, I don’t know what will!

Prerequisites

You will get the most out of this course if you …

  • Know the basic principles of experimental design (randomization and replication)
  • Are familiar with statistical tests and models (t-test, analysis of variance, mixed models)
  • Know a little bit about data analysis using R

How to follow the course

  • Slides and text version of lessons are online
  • Fill in code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

At the end of this course, you will understand …

  • What statistical power is and why it’s important
  • Why we have to trade off between false positives and false negatives
  • Why power depends on sample size
  • What effect size is and why statistical power depends on it
  • That doing a power analysis after you’ve collected the data is pointless
  • That the quality and realism of a power analysis is better, the more work you put into it

At the end of this course, you will be able to …

  • Extract effect sizes from published papers or previous studies
  • Calculate exact statistical power for simple study designs
  • Estimate statistical power using simulation for study designs that require mixed models

Power basics

A framework for statistical power

  • We are working within the classical statistical framework of using evidence from the data to try to reject a null hypothesis
  • We can never be 100% certain of the truth without measuring every individual in the (sometimes hypothetical) population
  • We might be wrong because of natural variation, measurement error, biases in our sample, or any/all of those

Right and wrong in different ways

  • Assume for now we are trying to determine the truth about a binary (yes or no) outcome
  • You can be right or wrong in different ways, depending on what the truth is
    • False positive = Type I error
    • False negative = Type II error

Two ways to be wrong

  • \(Y\) is the truth, \(\hat{Y}\) is the model’s prediction

Definition of statistical power

  • Power of a statistical test: the probability that the test will detect a phenomenon if the phenomenon is true
  • Follows directly from the idea of true and false positives and negatives
  • Power is the probability of declaring a true positive if the null hypothesis is false
  • In contrast, the p-value is the probability of declaring a false positive if the null hypothesis is true

POWER: If an effect exists, the chance that our study will find it

You can’t be right all the time

  • It is impossible to completely eliminate all false positives and all false negatives
  • The only way to be 100% certain you will never get a false positive is for your test to give 100% negative results
  • But a pregnancy test that always says “You’re not pregnant” is completely useless!
  • We have to find the sweet spot that reduces both false positives and false negatives to an acceptably low rate

Magic number: false positive rate

  • Traditional paradigm of Western science is conservative
  • Low probability of false positives, at the cost of a fairly high rate of false negatives
  • Familiar \(p < 0.05\) comes from the significance level \(\alpha = 0.05\): 5% probability of a false positive

Magic number 2: false negative rate

  • What about false negatives? Commonly we target 20% false negative rate, or \(\beta = 0.20\)
  • \(\frac{0.20}{0.05}\) = 4:1 ratio of probability of a false negative to probability of a false positive
  • No reason why we must use a 4:1 ratio

Tradition!

Power is a function of false negative rate

  • We usually express power as the probability of a positive result given that the null hypothesis is false
  • If null hypothesis is false, we can either get a true positive or a false negative
  • Probability of true positive + probability of false negative = 1
  • \(\beta = 0.2\) and \(1 - 0.2 = 0.8\), so 80% power is our magic number
  • This is the level of statistical power often required by review boards

Tradeoff between false positives and false negatives

t-test with \(n = 50\) per group and a medium effect size

What’s wrong with an underpowered study: false negatives

  • An underpowered study is one that has a low probability of detecting a significant effect, if the effect truly exists in the world
  • Imagine we do an underpowered study and get a negative result: we can’t reject the null hypothesis
  • Because the power is low, we don’t know if we got a true negative or a false negative
  • No new knowledge was gained, we might as well not have done the study at all

What’s wrong with an underpowered study: false positives

  • But what if the underpowered study gives us a positive (significant) result?
  • Some people say that means the study was powerful enough after all, and power only matters if we don’t get a significant result

False positives in underpowered studies, continued

  • At small sample sizes when power is low, natural variation/measurement error/sampling bias can cause overestimates of the effect size
  • At low power, even though false positive rate is fixed at 0.05, the ratio of false positives:true positives increases
  • A single positive result is not very convincing if it comes from an underpowered study
  • Publication bias in favor of positive results from small studies makes the problem worse

Post hoc power analysis is pointless

  • Reviewers in applied science journals often ask scientists to do a power analysis on a study after the fact
  • But if you do a power analysis assuming the observed difference is the true difference, you will always find low power if you didn’t detect, and high power if you did detect
  • If the sample size is low, we don’t know whether the observed difference is the true difference
  • You already knew if the study was underpowered before starting the study
  • The power of a study after the fact is a mathematical function of the p-value

Power isn’t everything

  • It is not wrong to do an underpowered study, if it is the only way we can learn about a system
  • It may be prohibitively expensive to add more experimental units
  • It is especially important to be careful and precise with measurements and to ensure your small sample is representative
  • Draw conclusions with caution, and be explicit about the limitations of your inference

Power analysis is a ballpark figure

  • If you knew exactly how large the effect is in your system, you would know exactly how many replicates you need
  • But if you knew exactly how large the effect is in your system, you wouldn’t need to do the study!
  • Power calculations are at best rough estimates

Image (c) Britannica

Err on the side of caution

  • If you collect too many samples: excess resources wasted/subjects harmed, but you still learn something
  • If you collect the “just right” number: resources used most efficiently to gain knowledge … but we can never know this exact number in advance!
  • If you collect too few samples: study is inconclusive and we do not learn anything. All resources spent are wasted

Image (c) Kidsotic

What do we need to know to do a power analysis?

Some are known and some need to be estimated:

  • Significance level
  • Power
  • Sample size
  • Effect size

(Un)known quantity 1: Significance level

  • Most classical stat analyses in ag science assume \(\alpha = 0.05\)
  • But \(\alpha\) may need to be adjusted if you have multiple comparisons
  • This adjustment is often (wrongly) neglected when doing power analysis

(Un)known quantity 2: Power

  • Sometimes power analysis is done assuming a desired level of power (often 80%, or \(\beta = 0.20\))
  • Or we may estimate the power to detect a given effect at a given sample size

(Un)known quantity 3: Sample size

  • We may know how many samples/replicates we can (afford to) collect or measure
  • Or we may calculate the minimum sample size needed to get a certain power
  • Often there is a range of feasible sample sizes, not just a single number

Sample size: the more the better, up to a point

  • Statistical inference: take a sample, calculate statistics, use statistics to estimate parameters (true values) of the sampled population
  • The bigger the sample, the closer our statistical estimates get to the true parameters

  • The rate at which our accuracy increases slows down as the estimate asymptotically approaches the true value
  • We get diminishing returns of adding more data
    • Power can’t be >100%
  • We need power analysis to figure out where that curve bends and where we should be on that curve
    • Most accurate estimates for the least investment of time and resources

(Un)known quantity 4: Effect size

  • The size of a practically meaningful effect that you want to be able to detect statistically
  • Often not a single number but a range
  • Probably the most important quantity for power analysis
  • Also the one that takes the most effort to determine

What is effect size and why is it important?

  • Measure of the magnitude of a difference between groups or strength of a relationship between variables
    • Difference in mean biomass of a control group and treatment group
    • Slope of the relationship between drug dosage and physiological response

Statistical power increases as effect size increases

  • The stronger the effect, the easier it is to see that effect above the background noise
  • True positives become more likely and false negatives less likely

t-test power depends on effect size

  • \(n = 50\) per group in all three cases
  • It’s easy to get high power at low false positive rate if the effect size is large
  • If the effect size is small, our only hope is to collect more data

Standardized effect sizes

  • Effect sizes are almost always in standardized units
  • Different studies measure response variables in different units (biomass in kg or g or lb, height in cm, m, or in)
  • If we convert to standard deviation units, we can compare results that are on different scales

What are some commonly used effect sizes?

  • Some based on magnitude of difference between means
  • Others based on the proportion of variance explained by a statistical model
  • These are only a small sampler of what’s out there
Effect Effect size metric Description Example
Difference of the mean of two groups Cohen’s d The difference between the two means divided by their combined standard deviation. How many standard deviations greater is one mean than the other? The effect of a management intervention on biomass yield
Difference in a binary outcome between two groups Odds ratio The ratio of the odds of the outcome happening in one group, to the odds of it happening in another group The effect of a seed treatment on seed germination rate
Strength of relationship between two continuous variables Pearson correlation coefficient r You should know this one! The effect of temperature on soil respiration
Variance explained by treatment group when comparing three or more groups Cohen’s f (Square root of) the ratio of variance explained by group, to the residual variance not explained by group The effect of multiple genotypes and environments on a trait

What software should I use?

R packages:

SAS procedures:

  • proc power
  • proc glmpower
  • proc glimmix

Estimating effect sizes

How to get information about effect size

  • Data from your own lab
  • The literature
  • Prior external information
  • Effect size benchmarks

Effect sizes from data

  • Find datasets that are roughly comparable to what you expect in your proposed study
    • Datasets where your response of interest is measured in a similar system
    • Datasets where a different variable is measured in your study system
  • Calculate effect sizes from the prior datasets, and design your study around that effect size

Let’s finally do some coding!

  • Load necessary packages
library(agridat)
library(effectsize)
library(lme4)
library(emmeans)
library(dplyr)
library(ggplot2)
library(WebPower)
library(simr)

Example of effect size calculation from data

Image (c) NCSU Extension
  • besag.beans dataset from agridat package
  • Yield data for different bean genotypes
  • Calculate Cohen’s f effect size
  • We will ignore blocking structure/spatial layout

Load and examine the data

data(besag.beans)

head(besag.beans)

Fit linear model

besag_fit <- lm(yield ~ gen, data = besag.beans)
  • anova() function calculates sums of squares for one-way ANOVA of yield by genotype
besag_anova <- anova(besag_fit)

Calculate Cohen’s f

  • f = square root of the ratio of treatment sum of squares:error sum of squares
    • Genotype is the treatment in this case
  • Compare our calculation with cohens_f() from effectsize package
sqrt(besag_anova$`Sum Sq`[1] / besag_anova$`Sum Sq`[2])

cohens_f(besag_fit)

Now what?

  • We get \(f = 0.85\) for the preliminary dataset
  • But this is just a single estimate from a single study
  • The true effect in your future study may differ
  • There is no “rule” about what effect size you should use to design your study
  • 50% of the observed effect size would be a good conservative choice

Effect sizes from the literature

  • Find studies on a similar topic or done in a similar system
  • If they published all their raw data you can proceed to calculating the effect size from the data
  • But you can still calculate effect size from published summary statistics, tables, or figures
  • Essentially a quick and informal meta-analysis

How many studies is enough?

  • Quality/relevance of the studies is more important than quantity
  • Not necessary to do an exhaustive literature search
  • Even a single study can be enough if it’s representative

Example: impact of transgenic Bt crops on abundance of soil organisms

  • We want to measure this well-studied effect in a new system or context
  • We can use previous studies on the same or similar response to get an idea of the range of effect sizes we can expect
  • Design our study to detect an effect on the low end of that range

Image (c) NCSU Entomology

Effect size for Bt impact

  • Difference of two means: mean soil organism abundance under isogenic line and transgenic line
  • Cohen’s d is standardized mean difference
    • Difference between the means divided by their pooled or “average” standard deviation
    • Tells us how far apart the means are in standard deviation units

Examples from the literature

  • Find published estimates of soil organismal abundance compared between transgenic Bt line and isogenic line
  • I selected these haphazardly from a 2022 meta-analysis of the topic
  • These are edited tables with only a small subset of the data

Frouz et al. 2008, Table 1

Frouz et al. 2008, data from Table 1

Cerevkova & Cagan 2015, Table 2

Cerevkova & Cagan 2015, data from Table 2

Arias-Martins et al. 2016, Table 4

Arias-Martins et al. 2016, data from Table 4

Getting from published results to effect sizes

  • Assume that the differences these researchers documented are the “true” differences in those systems
  • Also assume the Bt effect size in our system will be in the same neighborhood as the previous studies

Issues with extracting effect sizes from publications

  • Many if not most papers present more than one comparison
  • Not all papers present the means and standard deviations that we need
  • We can convert standard error of the mean to standard deviation: \(SD = SE\sqrt{n}\)
  • Again ignore any issues about distribution of the data, replication, etc.

Cohen’s d formulas

Formula for Cohen’s d effect size:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s}\]

Formula for \(s\), the pooled standard deviation:

\[s = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2} {n_1 + n_2 - 2}}\]

  • Essentially a weighted average of the two standard deviations, with a correction for small sample size

R function for Cohen’s d

calc_Cohen_d <- function(xbar1, xbar2, s1, s2, n1, n2) {
  s <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
  (xbar1 - xbar2) / s
}

Import data transcribed from the published tables

Bt_means <- read.csv('https://usda-ree-ars.github.io/SEAStats/power_analysis_tutorial/Bt_effect_size_data.csv')

# If on the Posit Cloud server, you can import from the data folder
Bt_means <- read.csv('data/Bt_effect_size_data.csv')

Examine the first few rows of the data

head(Bt_means)

Calculate Cohen’s d for each row

Bt_Cohen_d <- with(Bt_means, calc_Cohen_d(xbar1 = Bt_mean, xbar2 = nonBt_mean, 
                                          s1 = Bt_SD, s2 = nonBt_SD, 
                                          n1 = Bt_n, n2 = nonBt_n))
  • Some effect sizes are undefined because all zeros were observed for a few cases
  • Again we can get away with ignoring those because this is a ballpark estimate
Bt_Cohen_d <- na.omit(Bt_Cohen_d)

Examine effect size distribution

  • Take absolute value abs() of effect sizes because we are interested in detecting either direction of effect with a two-sided test
  • Histogram of effect sizes annotated with lines for median, 25% quantile, and 75% quantile
Bt_d_quantiles <- quantile(abs(Bt_Cohen_d), probs = c(.25, .5, .75))

ggplot(data.frame(d = abs(Bt_Cohen_d)), aes(x = d)) +
  geom_histogram(bins = 10) +
  geom_vline(xintercept = Bt_d_quantiles[1], color = 'forestgreen', linewidth = 1) +
  geom_vline(xintercept = Bt_d_quantiles[2], color = 'darkorange', linewidth = 2) +
  geom_vline(xintercept = Bt_d_quantiles[3], color = 'forestgreen', linewidth = 1) +
  theme_bw()

round(Bt_d_quantiles, 3)

What effect size to use?

  • There is no single correct answer; it depends what you can justify
  • We could use the median (\(d = 0.46\), a medium effect size), or a lower quantile
  • Or you could use this distribution to get an idea of the power for a predetermined design
    • For example, if your study has 80% power to detect \(d \geq 0.3\), you’d be happy
    • If your study has 80% power to detect \(d \geq 0.6\), back to the drawing board!

Effect sizes based on external criteria

  • You are a domain-knowledge expert so you can justify effect sizes without any prior data
  • What effect size is biologically/practically/economically relevant?
  • You can consult with stakeholders to determine this
    • Example: a stakeholder is interested in a treatment that will increase yield by 2 bushels/acre, or decrease mortality by 25%
  • The problem is often coming up with a way to convert the stakeholder’s wishes to a standardized metric

Effect size based on no prior information

  • I often hear “Nothing like this has ever been done before, so I have no idea what the effect size should be”
  • Usually this indicates a failure to think creatively
  • You don’t need exactly the same study, anything with a similar response variable, or a different variable in a similar system, will do

Conventional effect size benchmarks

  • If you truly have no reasonable way of estimating effect size, we still have to assume one to calculate power
  • Most standardized effect size metrics have conventional benchmarks or thresholds
d f Effect size
0.20 0.10 Small
0.50 0.25 Moderate
0.80 0.40 Large

Happy medium

  • Common practice: target medium effect size
  • Designing a study to detect a small or weak effect requires a huge sample size
  • Designing a study to detect large effects is not worthwhile
  • I do not recommend blindly using the medium effect size all the time
  • The benchmarks are very general across fields and may not be appropriate for your study system
  • No substitute for finding relevant data or consulting experts/stakeholders

Power analysis using formulas

Plug-and-chug power analysis

  • Certain simple experimental designs have analytical equations for statistical power, with these variables:
    • Sample size
    • Desired false positive rate (alpha, often set to 0.05)
    • Effect size
    • Statistical power
  • Plug all but one into our formula and solve for the unknown

  • We can input alpha, the effect size we want to be able to detect, and the sample size we can realistically get, and estimate statistical power
  • We can input alpha, the effect size we want to be able to detect, and the target power, and estimate the minimum sample size needed to detect that effect size with the desired level of power
  • We can input alpha, power, and sample size, and estimate the smallest effect we can detect given that study design at the desired power level
    • minimum detectable effect size (MDES)

Simple power calculations with WebPower

  • WebPower package has many functions for different study designs
    • See help(package = 'WebPower')
  • Here we will use wp.anova() to calculate power for a one-way ANOVA design

Calculations with wp.anova()

In all cases assume k = 4 groups being compared

  • Calculate required sample size for 80% power, given effect size
  • Calculate minimum detectable effect size for 80% power, given sample size
  • Calculate power, given sample size and effect size

Sample size given power and effect size

  • k = 4: number of groups
  • f = 0.25: moderate Cohen’s f effect size
  • alpha = 0.05: typical significance level
  • power = 0.8: typical power to target
  • type = 'overall': power for the overall F-statistic in the ANOVA
    • Other type arguments are possible, for example the power to detect a difference in two specific group means
wp.anova(k = 4, f = 0.25, alpha = 0.05, power = 0.8, type = 'overall')

Effect size given power and sample size

  • What if we can only afford 60 total samples (15 per group)?
  • Do not specify f this time, but specify n = 60
wp.anova(k = 4, n = 60, alpha = 0.05, power = 0.8, type = 'overall')

Power given effect size and sample size

  • Our preliminary analysis tells us we can expect effects around f = 0.4
  • We can afford n = 60 samples
  • Specify all arguments except power
wp.anova(k = 4, n = 60, f = 0.4, alpha = 0.05, type = 'overall')

Running WebPower iteratively to generate a power curve

  • Graphically show how power or MDES changes over a range of assumptions
  • Define a number of sample sizes ns
  • Create all possible combinations of sample size and a range of effect sizes
ns <- c(20, 40, 60, 80, 100, 120, 140, 160, 180, 200)
n_f_grid <- expand.grid(n = ns, f = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6))

Power curve, continued

  • Call wp.anova() passing the vector of many n and f values to the function
  • Extract the $power column from the output of wp.anova() and add it to our original data frame
n_f_power <- wp.anova(k = 4, n = n_f_grid$n, f = n_f_grid$f, alpha = 0.05, type = 'overall')

n_f_grid$power <- n_f_power$power
  • Plot the results, with a dashed line drawn at 80% power
ggplot(n_f_grid, aes(x = n, y = power)) +
  geom_line(aes(group = f, color = factor(f)), linewidth = 0.8) +
  geom_point(aes(fill = factor(f)), size = 3, shape = 21) +
  geom_hline(yintercept = 0.8, linetype = 'dashed', linewidth = 1) +
  scale_color_brewer(name = "Cohen's f", palette = 'Reds', aesthetics = c('color', 'fill')) +
  scale_x_continuous(breaks = ns) +
  theme_bw()

Minimum detectable effect size curve

  • Use the same sample sizes as before
  • Target 80% power in all cases
  • Plot with a dashed line at \(f = 0.25\) (moderate effect size)
anova_MDES <- expand.grid(n = ns)
anova_MDES$MDES <- map_dbl(anova_MDES$n, ~ wp.anova(k = 4, n = ., alpha = 0.05, power = 0.8, type = 'overall')$f)

ggplot(anova_MDES, aes(x = n, y = MDES)) +
  geom_line(linewidth = 0.8) + geom_point(size = 3) + 
  geom_hline(yintercept = 0.25, linetype = 'dashed', linewidth = 1) +
  scale_x_continuous(breaks = ns) +
  theme_bw()

Power analysis using simulation

Why do we need to do power analysis by simulation?

  • Plug-and-chug power analysis is the exception, not the rule
  • Most experimental and observational study designs in ag science require mixed models to analyze
  • There is not a simple equation for power
  • Simulation to the rescue!

Basic power simulation workflow

  1. Generate a dataset using a given set of assumptions (effect size, sample size, variance components, etc.)
  2. Fit a statistical model to the dataset and calculate a p-value for your hypothesis of interest
  3. Repeat n times
  4. Your power is the proportion of times, out of n, that you observed a significant p-value

A simple simulation example: t-test

  • Assume the following:
    • Treatment mean = 12
    • Control mean = 9
    • Standard deviation of both groups = 3
    • Sample size of both groups = 30
  • t-test always assumes that individuals within each group have values normally distributed around a group mean

Simulate a single dataset

  • Define means, standard deviations, and sample sizes
  • Set random seed so we all get the same results
mu_t <- 12
mu_c <- 9
sd_t <- 3
sd_c <- 3
n_t <- 30
n_c <- 30

set.seed(1)
  • Draw random numbers from normal distributions with rnorm()
  • t.test() with all default options to compare means of the two random samples
x_t <- rnorm(n = n_t, mean = mu_t, sd = sd_t)
x_c <- rnorm(n = n_c, mean = mu_c, sd = sd_c)

t.test(x_t, x_c)

It’s significant … so what?

  • Result of a single simulation cannot give us power
  • Power is the probability of detecting an effect if it’s real
  • We cannot estimate a probability from a single yes/no outcome
  • We need to simulate many datasets and evaluate the significance for each one

Simulate 10000 datasets with a for loop

  • Initialize p_vals, a vector of zeros
  • for loop executes the same code 10000 times, increasing the value of i by 1 each time
    • Each time through the loop we draw rnorm()
    • We store the p-value in the ith element of p_vals
  • Proportion of the time that p < 0.05 = power!
n_sim <- 10000
p_vals <- rep(0, n_sim)

for (i in 1:n_sim) {
  x_t <- rnorm(n_t, mu_t, sd_t)
  x_c <- rnorm(n_c, mu_c, sd_c)
  p_vals[i] <- t.test(x_t, x_c)$p.value
}

mean(p_vals < 0.05)

Check our work

  • Compare simulation results with wp.t()
  • We need to supply effect size d
    • Means are 12 and 9, standard deviations are both 3, \(d = (12-9)/3 = 1\)
wp.t(n1 = 30, n2 = 30, d = 1, alpha = 0.05)
  • But we can’t always check our work like this

Power by simulation for mixed models

  • Mixed models: linear models that include random effects to account for grouping in the data
  • Also called multilevel models because they have more than one level of variance
    • Example: Individuals nested within blocks nested within environments

Mixed models have more unknowns

  • In addition to effect size and sample size, you also need variance components for each level
  • Statistical power depends on the variance at all levels
  • Split-split-plot or repeated measures in time may have many variance components

Example 1. Number of blocks for a randomized complete block design

  • We are planning a multi-environment wheat yield trial
  • Randomized complete block design in each environment, but how many replicates should we include?
  • Imagine we have data from a previous yield trial

Example dataset

  • linder.wheat from agridat package: wheat yield trial in Switzerland
    • 9 genotypes
    • Replicated in 7 environments
    • Each environment has RCBD with 4 replicates

Import dataset and fit mixed model

  • Fixed effects: genotype gen, environment env, G by E interaction gen:env
  • Random effect: intercept for block nested within environment (1 | block:env)
data(linder.wheat)

linder_lmm <- lmer(yield ~ gen + env + gen:env + (1 | block:env), 
                   data = linder.wheat)

Look at model parameter estimates

  • joint_tests(): ANOVA table with F-tests
  • VarCorr(): variances for block and for individual plots (residuals)
joint_tests(linder_lmm)
VarCorr(linder_lmm)

Power simulation with simr

  • simr package lets us simulate new datasets using a model object we fit with lmer()
  • Here we will simulate random effects for new datasets with different numbers of blocks (reps)
  • Repeat many times for each number of blocks
  • Do a statistical test of choice on each simulated dataset
  • Calculate power as % of times the test rejects the null hypothesis

Power sim code: extend()

  • Estimate power for every number of blocks from 3 to 8
  • We will use the F-test for genotype from the ANOVA
  • extend() creates a new model object to simulate from
    • along = 'block' means add new levels to block (n = 8)
    • Extended object doesn’t have new data in it yet but it will be the basis of the simulation
linder_lmm_extended <- extend(linder_lmm, along = 'block', n = 8)

Power sim code: powerCurve()

  • along = 'block' to simulate datasets with different numbers of blocks, up to the extended value 8
  • test specifies we’re testing the fixed effect of genotype using the F-test
  • n = 10 means simulate 10 datasets for each number of blocks
linder_pcurve <- powerCurve(linder_lmm_extended, 
                            along = 'block', 
                            test = fixed('gen', method = 'f'), 
                            nsim = 10, 
                            seed = 3)

Look at simulation results

  • 95% confidence intervals reflect the uncertainty due to finite number of simulations
linder_pcurve

plot(linder_pcurve)

Power curve with more simulations

  • I repeated this simulation with 1000 iterations
  • ~30 minutes (could be sped up by parallelizing)

Example 2. Binomial GLMM

  • We’re testing cattle for a disease: how many herds to sample and how many cattle per herd?
  • Previous data: cbpp dataset from lme4 package
  • Incidence of CBPP, a respiratory disease of zebu cattle in Ethiopia
  • 15 herds with variable numbers of cattle sampled (2 to 34)
  • For each herd, number of cattle with and without CBPP were counted at four time periods


Zebu, photo by Scott Bauer, USDA-ARS

Fit preliminary model to CBPP data

  • First convert data to long form (each row is an individual animal with disease = 0 if uninfected and disease = 1 if infected)
  • Binomial GLMM fit to the data
    • Binary response variable
    • Fixed effect for time period
    • Random intercept for herd (mean probabilities for each herd differ)
    • No random slope term for herd (effect of time period is modeled as being the same for each herd)
data(cbpp)
cbpp <- cbpp %>%
  group_by(herd, period) %>%
  reframe(animal = 1:n(),
          disease = rep(c(0, 1), c(size - incidence, incidence)))

cbpp_glmm <- glmer(disease ~ period + (1 | herd), data = cbpp, family = binomial)

extend() in two different ways

  • Extend number of herds to 20 using along = 'herd'
  • Extend number of cattle within each herd at each time period to 50 using within = 'period+herd'
cbpp_glmm_n_herds <- extend(cbpp_glmm, along = 'herd', n = 20)
cbpp_glmm_n_within_herd <- extend(cbpp_glmm, within = 'period+herd', n = 50)

Power simulation code

  • Second sim specifies breaks, specific numbers of cattle to test, otherwise default is used
  • We use Chi-squared test for the fixed effect of period (appropriate for a binomial model)
cbpp_pcurve_n_herds <- powerCurve(cbpp_glmm_n_herds, 
                                  along = 'herd', 
                                  test = fixed('period', method = 'chisq'), 
                                  nsim = 10, 
                                  seed = 4)

cbpp_pcurve_n_within_herd <- powerCurve(cbpp_glmm_n_within_herd, 
                                        within = 'period+herd', 
                                        test = fixed('period', method = 'chisq'), 
                                        nsim = 10, 
                                        seed = 5, 
                                        breaks = c(3, 5, 10, 15, 25, 50))

Plot the power curves

plot(cbpp_pcurve_n_herds)

plot(cbpp_pcurve_n_within_herd)

Power curves with 1000 iterations

Concluding remarks

What did we learn today about power analysis?

  • Power analysis tells you how much you can trust the answers you get to the questions you care about
  • Determining effect size(s) is the most difficult but most important part of power analysis
  • Power analysis often requires you to simulate data based on assumptions
  • The more effort you put into a power analysis, the better informed your assumptions are by real data, the better your power analysis

What practical skills did we learn today?

  • Estimating effect sizes from existing datasets
  • Calculating statistical power, minimum detectable effect size, or sample size for simple study designs using formulas
  • Simulating data from a model to estimate statistical power for common mixed-model designs

Final recommendations

  • Design your study around the weakest effect
  • Think about justifying your design decisions to reviewers, readers, and yourself
  • Ask yourself “How confident will I be in the results of this study?”
  • Please don’t think of power analysis as a box to check, but as a critical part of the scientific method

No power = no new knowledge = no science

  • See text version of lesson for further reading and exercises for additional practice
  • Fill out MS Forms survey
  • Contact me at quentin.read@usda.gov!