Power Analysis Tutorial

Why learn power analysis?

The single most-requested topic for SEA stats training
Many scientists are required to do power analysis and view it as an annoyance
But power analysis is critical when designing studies (experimental or observational) that give us reliable answers to questions we care about
If we can’t do that, we can’t do science

Motivation to learn power analysis

When done properly, power analysis:

Helps you design your study, foreseeing any issues that might arise
Reduces the chance of doing weak and inconclusive studies that are doomed before they even start
Elevates your science, and your science makes the world a better place

If that doesn’t motivate you to learn about power analysis, I don’t know what will!

Prerequisites

You will get the most out of this course if you …

Know the basic principles of experimental design (randomization and replication)
Are familiar with statistical tests and models (t-test, analysis of variance, mixed models)
Know a little bit about data analysis using R

How to follow the course

Slides and text version of lessons are online
Fill in code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind

At the end of this course, you will understand …

What statistical power is and why it’s important
Why we have to trade off between false positives and false negatives
Why power depends on sample size
What effect size is and why statistical power depends on it
That doing a power analysis after you’ve collected the data is pointless
That the quality and realism of a power analysis is better, the more work you put into it

At the end of this course, you will be able to …

Extract effect sizes from published papers or previous studies
Calculate exact statistical power for simple study designs
Estimate statistical power using simulation for study designs that require mixed models

Power basics

A framework for statistical power

We are working within the classical statistical framework of using evidence from the data to try to reject a null hypothesis
We can never be 100% certain of the truth without measuring every individual in the (sometimes hypothetical) population
We might be wrong because of natural variation, measurement error, biases in our sample, or any/all of those

Right and wrong in different ways

Assume for now we are trying to determine the truth about a binary (yes or no) outcome
You can be right or wrong in different ways, depending on what the truth is
- False positive = Type I error
- False negative = Type II error

Two ways to be wrong

$Y$ is the truth, $\hat{Y}$ is the model’s prediction

Definition of statistical power

Power of a statistical test: the probability that the test will detect a phenomenon if the phenomenon is true
Follows directly from the idea of true and false positives and negatives
Power is the probability of declaring a true positive if the null hypothesis is false
In contrast, the p-value is the probability of declaring a false positive if the null hypothesis is true

POWER: If an effect exists, the chance that our study will find it

You can’t be right all the time

It is impossible to completely eliminate all false positives and all false negatives
The only way to be 100% certain you will never get a false positive is for your test to give 100% negative results
But a pregnancy test that always says “You’re not pregnant” is completely useless!
We have to find the sweet spot that reduces both false positives and false negatives to an acceptably low rate

Magic number: false positive rate

Traditional paradigm of Western science is conservative
Low probability of false positives, at the cost of a fairly high rate of false negatives
Familiar $p < 0.05$ comes from the significance level $\alpha = 0.05$: 5% probability of a false positive

Magic number 2: false negative rate

What about false negatives? Commonly we target 20% false negative rate, or $\beta = 0.20$
$\frac{0.20}{0.05}$ = 4:1 ratio of probability of a false negative to probability of a false positive
No reason why we must use a 4:1 ratio

Tradition!

Power is a function of false negative rate

We usually express power as the probability of a positive result given that the null hypothesis is false
If null hypothesis is false, we can either get a true positive or a false negative
Probability of true positive + probability of false negative = 1
$\beta = 0.2$ and $1 - 0.2 = 0.8$, so 80% power is our magic number
This is the level of statistical power often required by review boards

Tradeoff between false positives and false negatives

t-test with $n = 50$ per group and a medium effect size

What’s wrong with an underpowered study: false negatives

An underpowered study is one that has a low probability of detecting a significant effect, if the effect truly exists in the world
Imagine we do an underpowered study and get a negative result: we can’t reject the null hypothesis
Because the power is low, we don’t know if we got a true negative or a false negative
No new knowledge was gained, we might as well not have done the study at all

What’s wrong with an underpowered study: false positives

But what if the underpowered study gives us a positive (significant) result?
Some people say that means the study was powerful enough after all, and power only matters if we don’t get a significant result

False positives in underpowered studies, continued

At small sample sizes when power is low, natural variation/measurement error/sampling bias can cause overestimates of the effect size
At low power, even though false positive rate is fixed at 0.05, the ratio of false positives:true positives increases
A single positive result is not very convincing if it comes from an underpowered study
Publication bias in favor of positive results from small studies makes the problem worse

Post hoc power analysis is pointless

Reviewers in applied science journals often ask scientists to do a power analysis on a study after the fact
But if you do a power analysis assuming the observed difference is the true difference, you will always find low power if you didn’t detect, and high power if you did detect
If the sample size is low, we don’t know whether the observed difference is the true difference
You already knew if the study was underpowered before starting the study
The power of a study after the fact is a mathematical function of the p-value

Power isn’t everything

It is not wrong to do an underpowered study, if it is the only way we can learn about a system
It may be prohibitively expensive to add more experimental units
It is especially important to be careful and precise with measurements and to ensure your small sample is representative
Draw conclusions with caution, and be explicit about the limitations of your inference

Power analysis is a ballpark figure

If you knew exactly how large the effect is in your system, you would know exactly how many replicates you need
But if you knew exactly how large the effect is in your system, you wouldn’t need to do the study!
Power calculations are at best rough estimates

Image (c) Britannica

Err on the side of caution

If you collect too many samples: excess resources wasted/subjects harmed, but you still learn something
If you collect the “just right” number: resources used most efficiently to gain knowledge … but we can never know this exact number in advance!
If you collect too few samples: study is inconclusive and we do not learn anything. All resources spent are wasted

Image (c) Kidsotic

What do we need to know to do a power analysis?

Some are known and some need to be estimated:

Significance level
Power
Sample size
Effect size

(Un)known quantity 1: Significance level

Most classical stat analyses in ag science assume $\alpha = 0.05$
But $\alpha$ may need to be adjusted if you have multiple comparisons
This adjustment is often (wrongly) neglected when doing power analysis

(Un)known quantity 2: Power

Sometimes power analysis is done assuming a desired level of power (often 80%, or $\beta = 0.20$)
Or we may estimate the power to detect a given effect at a given sample size

(Un)known quantity 3: Sample size

We may know how many samples/replicates we can (afford to) collect or measure
Or we may calculate the minimum sample size needed to get a certain power
Often there is a range of feasible sample sizes, not just a single number

Sample size: the more the better, up to a point

Statistical inference: take a sample, calculate statistics, use statistics to estimate parameters (true values) of the sampled population
The bigger the sample, the closer our statistical estimates get to the true parameters

The rate at which our accuracy increases slows down as the estimate asymptotically approaches the true value
We get diminishing returns of adding more data
- Power can’t be >100%
We need power analysis to figure out where that curve bends and where we should be on that curve
- Most accurate estimates for the least investment of time and resources

(Un)known quantity 4: Effect size

The size of a practically meaningful effect that you want to be able to detect statistically
Often not a single number but a range
Probably the most important quantity for power analysis
Also the one that takes the most effort to determine

What is effect size and why is it important?

Measure of the magnitude of a difference between groups or strength of a relationship between variables
- Difference in mean biomass of a control group and treatment group
- Slope of the relationship between drug dosage and physiological response

Statistical power increases as effect size increases

The stronger the effect, the easier it is to see that effect above the background noise
True positives become more likely and false negatives less likely

t-test power depends on effect size

$n = 50$ per group in all three cases
It’s easy to get high power at low false positive rate if the effect size is large
If the effect size is small, our only hope is to collect more data

Standardized effect sizes

Effect sizes are almost always in standardized units
Different studies measure response variables in different units (biomass in kg or g or lb, height in cm, m, or in)
If we convert to standard deviation units, we can compare results that are on different scales

What are some commonly used effect sizes?

Some based on magnitude of difference between means
Others based on the proportion of variance explained by a statistical model
These are only a small sampler of what’s out there

Effect	Effect size metric	Description	Example
Difference of the mean of two groups	Cohen’s d	The difference between the two means divided by their combined standard deviation. How many standard deviations greater is one mean than the other?	The effect of a management intervention on biomass yield
Difference in a binary outcome between two groups	Odds ratio	The ratio of the odds of the outcome happening in one group, to the odds of it happening in another group	The effect of a seed treatment on seed germination rate
Strength of relationship between two continuous variables	Pearson correlation coefficient r	You should know this one!	The effect of temperature on soil respiration
Variance explained by treatment group when comparing three or more groups	Cohen’s f	(Square root of) the ratio of variance explained by group, to the residual variance not explained by group	The effect of multiple genotypes and environments on a trait

What software should I use?

R packages:

SAS procedures:

proc power
proc glmpower
proc glimmix

Estimating effect sizes

How to get information about effect size

Data from your own lab
The literature
Prior external information
Effect size benchmarks

Effect sizes from data

Find datasets that are roughly comparable to what you expect in your proposed study
- Datasets where your response of interest is measured in a similar system
- Datasets where a different variable is measured in your study system
Calculate effect sizes from the prior datasets, and design your study around that effect size

Let’s finally do some coding!

Load necessary packages

library(agridat)
library(effectsize)
library(lme4)
library(emmeans)
library(dplyr)
library(ggplot2)
library(WebPower)
library(simr)

Example of effect size calculation from data

besag.beans dataset from agridat package
Yield data for different bean genotypes
Calculate Cohen’s f effect size
We will ignore blocking structure/spatial layout

Load and examine the data

data(besag.beans)

head(besag.beans)

Fit linear model

besag_fit <- lm(yield ~ gen, data = besag.beans)

anova() function calculates sums of squares for one-way ANOVA of yield by genotype

besag_anova <- anova(besag_fit)

Calculate Cohen’s f

f = square root of the ratio of treatment sum of squares:error sum of squares
- Genotype is the treatment in this case
Compare our calculation with cohens_f() from effectsize package

sqrt(besag_anova$`Sum Sq`[1] / besag_anova$`Sum Sq`[2])

cohens_f(besag_fit)

Now what?

We get $f = 0.85$ for the preliminary dataset
But this is just a single estimate from a single study
The true effect in your future study may differ
There is no “rule” about what effect size you should use to design your study
50% of the observed effect size would be a good conservative choice

Effect sizes from the literature

Find studies on a similar topic or done in a similar system
If they published all their raw data you can proceed to calculating the effect size from the data
But you can still calculate effect size from published summary statistics, tables, or figures
Essentially a quick and informal meta-analysis

How many studies is enough?

Quality/relevance of the studies is more important than quantity
Not necessary to do an exhaustive literature search
Even a single study can be enough if it’s representative

Example: impact of transgenic Bt crops on abundance of soil organisms

We want to measure this well-studied effect in a new system or context
We can use previous studies on the same or similar response to get an idea of the range of effect sizes we can expect
Design our study to detect an effect on the low end of that range

Effect size for Bt impact

Difference of two means: mean soil organism abundance under isogenic line and transgenic line
Cohen’s d is standardized mean difference
- Difference between the means divided by their pooled or “average” standard deviation
- Tells us how far apart the means are in standard deviation units

Examples from the literature

Find published estimates of soil organismal abundance compared between transgenic Bt line and isogenic line
I selected these haphazardly from a 2022 meta-analysis of the topic
These are edited tables with only a small subset of the data

Frouz et al. 2008, Table 1

Frouz et al. 2008, data from Table 1

Cerevkova & Cagan 2015, Table 2

Cerevkova & Cagan 2015, data from Table 2

Arias-Martins et al. 2016, Table 4

Arias-Martins et al. 2016, data from Table 4

Getting from published results to effect sizes

Assume that the differences these researchers documented are the “true” differences in those systems
Also assume the Bt effect size in our system will be in the same neighborhood as the previous studies

Issues with extracting effect sizes from publications

Many if not most papers present more than one comparison
Not all papers present the means and standard deviations that we need
We can convert standard error of the mean to standard deviation: $SD = SE\sqrt{n}$
Again ignore any issues about distribution of the data, replication, etc.

Cohen’s d formulas

Formula for Cohen’s d effect size:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s}\]

Formula for $s$, the pooled standard deviation:

\[s = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2} {n_1 + n_2 - 2}}\]

Essentially a weighted average of the two standard deviations, with a correction for small sample size

R function for Cohen’s d

calc_Cohen_d <- function(xbar1, xbar2, s1, s2, n1, n2) {
  s <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
  (xbar1 - xbar2) / s
}

Import data transcribed from the published tables

Bt_means <- read.csv('https://usda-ree-ars.github.io/SEAStats/power_analysis_tutorial/Bt_effect_size_data.csv')

# If on the Posit Cloud server, you can import from the data folder
Bt_means <- read.csv('data/Bt_effect_size_data.csv')

Examine the first few rows of the data

head(Bt_means)

Calculate Cohen’s d for each row

Bt_Cohen_d <- with(Bt_means, calc_Cohen_d(xbar1 = Bt_mean, xbar2 = nonBt_mean, 
                                          s1 = Bt_SD, s2 = nonBt_SD, 
                                          n1 = Bt_n, n2 = nonBt_n))

Some effect sizes are undefined because all zeros were observed for a few cases
Again we can get away with ignoring those because this is a ballpark estimate

Bt_Cohen_d <- na.omit(Bt_Cohen_d)

Examine effect size distribution

Take absolute value abs() of effect sizes because we are interested in detecting either direction of effect with a two-sided test
Histogram of effect sizes annotated with lines for median, 25% quantile, and 75% quantile

Bt_d_quantiles <- quantile(abs(Bt_Cohen_d), probs = c(.25, .5, .75))

ggplot(data.frame(d = abs(Bt_Cohen_d)), aes(x = d)) +
  geom_histogram(bins = 10) +
  geom_vline(xintercept = Bt_d_quantiles[1], color = 'forestgreen', linewidth = 1) +
  geom_vline(xintercept = Bt_d_quantiles[2], color = 'darkorange', linewidth = 2) +
  geom_vline(xintercept = Bt_d_quantiles[3], color = 'forestgreen', linewidth = 1) +
  theme_bw()

round(Bt_d_quantiles, 3)

What effect size to use?

There is no single correct answer; it depends what you can justify
We could use the median ($d = 0.46$, a medium effect size), or a lower quantile
Or you could use this distribution to get an idea of the power for a predetermined design
- For example, if your study has 80% power to detect $d \geq 0.3$, you’d be happy
- If your study has 80% power to detect $d \geq 0.6$, back to the drawing board!

Effect sizes based on external criteria

You are a domain-knowledge expert so you can justify effect sizes without any prior data
What effect size is biologically/practically/economically relevant?
You can consult with stakeholders to determine this
- Example: a stakeholder is interested in a treatment that will increase yield by 2 bushels/acre, or decrease mortality by 25%
The problem is often coming up with a way to convert the stakeholder’s wishes to a standardized metric

Effect size based on no prior information

I often hear “Nothing like this has ever been done before, so I have no idea what the effect size should be”
Usually this indicates a failure to think creatively
You don’t need exactly the same study, anything with a similar response variable, or a different variable in a similar system, will do

Conventional effect size benchmarks

If you truly have no reasonable way of estimating effect size, we still have to assume one to calculate power
Most standardized effect size metrics have conventional benchmarks or thresholds

d	f	Effect size
0.20	0.10	Small
0.50	0.25	Moderate
0.80	0.40	Large

Happy medium

Common practice: target medium effect size
Designing a study to detect a small or weak effect requires a huge sample size
Designing a study to detect large effects is not worthwhile
I do not recommend blindly using the medium effect size all the time
The benchmarks are very general across fields and may not be appropriate for your study system
No substitute for finding relevant data or consulting experts/stakeholders

Power analysis using formulas

Plug-and-chug power analysis

Certain simple experimental designs have analytical equations for statistical power, with these variables:
- Sample size
- Desired false positive rate (alpha, often set to 0.05)
- Effect size
- Statistical power
Plug all but one into our formula and solve for the unknown

We can input alpha, the effect size we want to be able to detect, and the sample size we can realistically get, and estimate statistical power
We can input alpha, the effect size we want to be able to detect, and the target power, and estimate the minimum sample size needed to detect that effect size with the desired level of power
We can input alpha, power, and sample size, and estimate the smallest effect we can detect given that study design at the desired power level
- minimum detectable effect size (MDES)

Simple power calculations with WebPower

WebPower package has many functions for different study designs
- See help(package = 'WebPower')
Here we will use wp.anova() to calculate power for a one-way ANOVA design

Calculations with `wp.anova()`

In all cases assume k = 4 groups being compared

Calculate required sample size for 80% power, given effect size
Calculate minimum detectable effect size for 80% power, given sample size
Calculate power, given sample size and effect size

Sample size given power and effect size

k = 4: number of groups
f = 0.25: moderate Cohen’s f effect size
alpha = 0.05: typical significance level
power = 0.8: typical power to target
type = 'overall': power for the overall F-statistic in the ANOVA
- Other type arguments are possible, for example the power to detect a difference in two specific group means

wp.anova(k = 4, f = 0.25, alpha = 0.05, power = 0.8, type = 'overall')

Effect size given power and sample size

What if we can only afford 60 total samples (15 per group)?
Do not specify f this time, but specify n = 60

wp.anova(k = 4, n = 60, alpha = 0.05, power = 0.8, type = 'overall')

Power given effect size and sample size

Our preliminary analysis tells us we can expect effects around f = 0.4
We can afford n = 60 samples
Specify all arguments except power

wp.anova(k = 4, n = 60, f = 0.4, alpha = 0.05, type = 'overall')

Running WebPower iteratively to generate a power curve

Graphically show how power or MDES changes over a range of assumptions
Define a number of sample sizes ns
Create all possible combinations of sample size and a range of effect sizes

ns <- c(20, 40, 60, 80, 100, 120, 140, 160, 180, 200)
n_f_grid <- expand.grid(n = ns, f = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6))

Power curve, continued

Call wp.anova() passing the vector of many n and f values to the function
Extract the $power column from the output of wp.anova() and add it to our original data frame

n_f_power <- wp.anova(k = 4, n = n_f_grid$n, f = n_f_grid$f, alpha = 0.05, type = 'overall')

n_f_grid$power <- n_f_power$power

Plot the results, with a dashed line drawn at 80% power

ggplot(n_f_grid, aes(x = n, y = power)) +
  geom_line(aes(group = f, color = factor(f)), linewidth = 0.8) +
  geom_point(aes(fill = factor(f)), size = 3, shape = 21) +
  geom_hline(yintercept = 0.8, linetype = 'dashed', linewidth = 1) +
  scale_color_brewer(name = "Cohen's f", palette = 'Reds', aesthetics = c('color', 'fill')) +
  scale_x_continuous(breaks = ns) +
  theme_bw()

Minimum detectable effect size curve

Use the same sample sizes as before
Target 80% power in all cases
Plot with a dashed line at $f = 0.25$ (moderate effect size)

anova_MDES <- expand.grid(n = ns)
anova_MDES$MDES <- map_dbl(anova_MDES$n, ~ wp.anova(k = 4, n = ., alpha = 0.05, power = 0.8, type = 'overall')$f)

ggplot(anova_MDES, aes(x = n, y = MDES)) +
  geom_line(linewidth = 0.8) + geom_point(size = 3) + 
  geom_hline(yintercept = 0.25, linetype = 'dashed', linewidth = 1) +
  scale_x_continuous(breaks = ns) +
  theme_bw()

Power analysis using simulation

Why do we need to do power analysis by simulation?

Plug-and-chug power analysis is the exception, not the rule
Most experimental and observational study designs in ag science require mixed models to analyze
There is not a simple equation for power
Simulation to the rescue!

Basic power simulation workflow

Generate a dataset using a given set of assumptions (effect size, sample size, variance components, etc.)
Fit a statistical model to the dataset and calculate a p-value for your hypothesis of interest
Repeat n times
Your power is the proportion of times, out of n, that you observed a significant p-value

A simple simulation example: t-test

Assume the following:
- Treatment mean = 12
- Control mean = 9
- Standard deviation of both groups = 3
- Sample size of both groups = 30
t-test always assumes that individuals within each group have values normally distributed around a group mean

Simulate a single dataset

Define means, standard deviations, and sample sizes
Set random seed so we all get the same results

mu_t <- 12
mu_c <- 9
sd_t <- 3
sd_c <- 3
n_t <- 30
n_c <- 30

set.seed(1)

Draw random numbers from normal distributions with rnorm()
t.test() with all default options to compare means of the two random samples

x_t <- rnorm(n = n_t, mean = mu_t, sd = sd_t)
x_c <- rnorm(n = n_c, mean = mu_c, sd = sd_c)

t.test(x_t, x_c)

It’s significant … so what?

Result of a single simulation cannot give us power
Power is the probability of detecting an effect if it’s real
We cannot estimate a probability from a single yes/no outcome
We need to simulate many datasets and evaluate the significance for each one

Simulate 10000 datasets with a `for` loop

Initialize p_vals, a vector of zeros
for loop executes the same code 10000 times, increasing the value of i by 1 each time
- Each time through the loop we draw rnorm()
- We store the p-value in the ith element of p_vals
Proportion of the time that p < 0.05 = power!

n_sim <- 10000
p_vals <- rep(0, n_sim)

for (i in 1:n_sim) {
  x_t <- rnorm(n_t, mu_t, sd_t)
  x_c <- rnorm(n_c, mu_c, sd_c)
  p_vals[i] <- t.test(x_t, x_c)$p.value
}

mean(p_vals < 0.05)

Check our work

Compare simulation results with wp.t()
We need to supply effect size d
- Means are 12 and 9, standard deviations are both 3, $d = (12-9)/3 = 1$

wp.t(n1 = 30, n2 = 30, d = 1, alpha = 0.05)

But we can’t always check our work like this

Power by simulation for mixed models

Mixed models: linear models that include random effects to account for grouping in the data
Also called multilevel models because they have more than one level of variance
- Example: Individuals nested within blocks nested within environments

Mixed models have more unknowns

In addition to effect size and sample size, you also need variance components for each level
Statistical power depends on the variance at all levels
Split-split-plot or repeated measures in time may have many variance components

Example 1. Number of blocks for a randomized complete block design

We are planning a multi-environment wheat yield trial
Randomized complete block design in each environment, but how many replicates should we include?
Imagine we have data from a previous yield trial

Example dataset

linder.wheat from agridat package: wheat yield trial in Switzerland
- 9 genotypes
- Replicated in 7 environments
- Each environment has RCBD with 4 replicates

Import dataset and fit mixed model

Fixed effects: genotype gen, environment env, G by E interaction gen:env
Random effect: intercept for block nested within environment (1 | block:env)

data(linder.wheat)

linder_lmm <- lmer(yield ~ gen + env + gen:env + (1 | block:env), 
                   data = linder.wheat)

Look at model parameter estimates

joint_tests(): ANOVA table with F-tests
VarCorr(): variances for block and for individual plots (residuals)

joint_tests(linder_lmm)
VarCorr(linder_lmm)

Power simulation with simr

simr package lets us simulate new datasets using a model object we fit with lmer()
Here we will simulate random effects for new datasets with different numbers of blocks (reps)
Repeat many times for each number of blocks
Do a statistical test of choice on each simulated dataset
Calculate power as % of times the test rejects the null hypothesis

Power sim code: `extend()`

Estimate power for every number of blocks from 3 to 8
We will use the F-test for genotype from the ANOVA
extend() creates a new model object to simulate from
- along = 'block' means add new levels to block (n = 8)
- Extended object doesn’t have new data in it yet but it will be the basis of the simulation

linder_lmm_extended <- extend(linder_lmm, along = 'block', n = 8)

Power sim code: `powerCurve()`

along = 'block' to simulate datasets with different numbers of blocks, up to the extended value 8
test specifies we’re testing the fixed effect of genotype using the F-test
n = 10 means simulate 10 datasets for each number of blocks

linder_pcurve <- powerCurve(linder_lmm_extended, 
                            along = 'block', 
                            test = fixed('gen', method = 'f'), 
                            nsim = 10, 
                            seed = 3)

Look at simulation results

95% confidence intervals reflect the uncertainty due to finite number of simulations

linder_pcurve

plot(linder_pcurve)

Power curve with more simulations

I repeated this simulation with 1000 iterations
~30 minutes (could be sped up by parallelizing)

Example 2. Binomial GLMM

We’re testing cattle for a disease: how many herds to sample and how many cattle per herd?
Previous data: cbpp dataset from lme4 package
Incidence of CBPP, a respiratory disease of zebu cattle in Ethiopia
15 herds with variable numbers of cattle sampled (2 to 34)
For each herd, number of cattle with and without CBPP were counted at four time periods

Zebu, photo by Scott Bauer, USDA-ARS

Fit preliminary model to CBPP data

First convert data to long form (each row is an individual animal with disease = 0 if uninfected and disease = 1 if infected)
Binomial GLMM fit to the data
- Binary response variable
- Fixed effect for time period
- Random intercept for herd (mean probabilities for each herd differ)
- No random slope term for herd (effect of time period is modeled as being the same for each herd)

data(cbpp)
cbpp <- cbpp %>%
  group_by(herd, period) %>%
  reframe(animal = 1:n(),
          disease = rep(c(0, 1), c(size - incidence, incidence)))

cbpp_glmm <- glmer(disease ~ period + (1 | herd), data = cbpp, family = binomial)

`extend()` in two different ways

Extend number of herds to 20 using along = 'herd'
Extend number of cattle within each herd at each time period to 50 using within = 'period+herd'

cbpp_glmm_n_herds <- extend(cbpp_glmm, along = 'herd', n = 20)
cbpp_glmm_n_within_herd <- extend(cbpp_glmm, within = 'period+herd', n = 50)

Power simulation code

Second sim specifies breaks, specific numbers of cattle to test, otherwise default is used
We use Chi-squared test for the fixed effect of period (appropriate for a binomial model)

cbpp_pcurve_n_herds <- powerCurve(cbpp_glmm_n_herds, 
                                  along = 'herd', 
                                  test = fixed('period', method = 'chisq'), 
                                  nsim = 10, 
                                  seed = 4)

cbpp_pcurve_n_within_herd <- powerCurve(cbpp_glmm_n_within_herd, 
                                        within = 'period+herd', 
                                        test = fixed('period', method = 'chisq'), 
                                        nsim = 10, 
                                        seed = 5, 
                                        breaks = c(3, 5, 10, 15, 25, 50))

Plot the power curves

plot(cbpp_pcurve_n_herds)

plot(cbpp_pcurve_n_within_herd)

Power curves with 1000 iterations

Concluding remarks

What did we learn today about power analysis?

Power analysis tells you how much you can trust the answers you get to the questions you care about
Determining effect size(s) is the most difficult but most important part of power analysis
Power analysis often requires you to simulate data based on assumptions
The more effort you put into a power analysis, the better informed your assumptions are by real data, the better your power analysis

What practical skills did we learn today?

Estimating effect sizes from existing datasets
Calculating statistical power, minimum detectable effect size, or sample size for simple study designs using formulas
Simulating data from a model to estimate statistical power for common mixed-model designs

Final recommendations

Design your study around the weakest effect
Think about justifying your design decisions to reviewers, readers, and yourself
Ask yourself “How confident will I be in the results of this study?”
Please don’t think of power analysis as a box to check, but as a critical part of the scientific method

No power = no new knowledge = no science

See text version of lesson for further reading and exercises for additional practice
Fill out MS Forms survey
Contact me at quentin.read@usda.gov!