Introduction and setup

What is this course?

After each stats workshop I’ve taught, I’ve included a question on the post-class survey asking whether there are any other topics people want me to develop a new lesson about. The single most requested topic for me to create a new lesson was power analysis. I agree that this is a crucially important topic. But as important as it is, it is often overlooked or ignored in applied scientists’ statistical training. Those that do know anything about it tend to be folks who work on animals and humans. That’s because oversight committees like IACUC often require a formal power analysis that is subject to approval along with the rest of their research plans. But unfortunately, this is often seen as a pesky annoyance to take care of on the way to doing real science, a bureaucratic requirement to check off a checklist.

My goal in this lesson is not only to teach you the practical skills needed to do a power analysis using R software. It’s also, hopefully, to convince you that power analysis is not just a hurdle to jump to get your study funded. It is a necessary part of the scientific method. Because science is the best method that we humans have come up with so far to learn about the natural world, making sure our scientific studies are properly designed is so important. We can all agree that we want scientific studies to have the highest possible chance of being successful and teaching us something about the world, with the least possible unnecessary waste of time, money, and resources. Doing power analysis before starting a study, whether it’s observational or experimental, is a step toward that goal.

When done properly, power analysis:

Helps you design your study, foreseeing any issues that might arise
Reduces the chance of doing weak and inconclusive studies that are doomed before they even start
Elevates your science, and your science makes the world a better place

If that doesn’t motivate you to learn about power analysis, I don’t know what will!

You may want to refer to my more “big picture” talk on power analysis (video recording on Axon+, PDF of slides).

What will you learn from this course?

Conceptual learning objectives

At the end of this course, you will understand …

What statistical power is and why it’s important
Why we have to trade off between false positives and false negatives
Why power depends on sample size
What effect size is and why statistical power depends on it
That doing a power analysis after you’ve collected the data is pointless
That the quality and realism of a power analysis is better, the more work you put into it

Practical skills

At the end of this course, you will be able to …

Extract effect sizes from published papers or previous studies
Calculate exact statistical power for simple study designs
Estimate statistical power using simulation for study designs that require mixed models

Power: the basics

What is statistical power?

To be able to understand the results of a power analysis, it’s important that you know what exactly statistical power is, and why we care about it. Here, we’re going to assume we are working in the classical statistical framework: our goal is to use evidence from the data to test a null hypothesis. If we have enough evidence to exceed a pre-defined level of statistical significance, we may reject the null hypothesis; otherwise, we can’t reject it. We are also going to assume there is a “true” value out there in the world somewhere, that we are trying to estimate using statistics. We can never know for sure what that true value is, unless we measure every single individual in the population of interest (and in some cases the population is hypothetical so that wouldn’t even be possible with unlimited resources). So the best we can do is maximize our chance of estimating that value correctly. Sometimes we get it right, and sometimes we get it wrong. That might be because of natural variation, measurement error, unknown biases in our sample, or any of the above . . . the point is that we are either right or wrong, but we are never sure which.

So, we’re considering the black-and-white world where we can either be right or wrong (later in this lesson we will briefly touch on the Bayesian perspective, which treats being right or wrong as more of a continuum). Let’s simplify further and say there is a binary (yes or no) outcome we are trying to estimate. A pregnancy test is a good example: you are either pregnant or you’re not, and the test also returns a binary answer. If you are truly not pregnant and the pregnancy test comes out negative, that’s a “true negative.” If it comes out positive, that’s a “false positive.” If you are truly pregnant and the test is positive, that’s a “true positive.” If negative, it’s a “false negative.” So, true negative and true positive are both correct answers, and false positive and false negative are both errors. You may hear false positive called “Type I error” and false negative called “Type II error.” It’s kind of hard to keep those numbers straight so we will refer to them as false positive and false negative.

The definition of statistical power follows directly from this idea of false positives and false negatives. The power of a statistical test to detect a given phenomenon is the probability the test will detect the phenomenon if the phenomenon is true. In other words it is the probability of observing a true positive. You can only observe a true positive if the null hypothesis is false. This is, in a way, the “opposite” of a p-value. The p-value is essentially the probability of a false positive: the probability that we would observe the data we did if the null hypothesis were true.

We can state it this way:

If a certain effect exists, power is the chance that we actually find the effect in our study.

Obviously, we want to be right as often as possible and wrong as little as possible. It would be great to completely eliminate all false positives and all false negatives. Unfortunately, that is impossible. The only way to be 100% certain that you will never get a false positive is to have a test that gives negative results 100% of the time. A pregnancy test that always says “You are not pregnant” will never result in a false positive! But it is also completely useless. The same goes for eliminating false negatives: the only way to do that with perfect certainty is to give 100% positive results. Obviously, we need some kind of compromise. We need to find the “sweet spot” in the tradeoff between false positives and false negatives.

The traditional paradigm of Western science is to be conservative when rejecting null hypotheses. We want to look like serious scientists who need a lot of evidence before we are persuaded to accept any new phenomenon. Because of this, traditionally we like to strike the balance so that we have a low probability of false positives, but a somewhat higher probability of false negatives. You all know the traditional cutoff of $p < 0.05$, based on the significance level $\alpha$. If a statistical test has $\alpha = 0.05$, it has a 5% probability of a false positive. The 0.05 cutoff, for good or bad, is very widely accepted. But what’s the probability of false negatives? Well, we usually set it at 20%. That’s a 4:1 ratio of the probability of a false negative if the positive is true, to the probability of a false positive if the negative is true. Just like there’s nothing magical about the 0.05 cutoff, there is also nothing special about the 4:1 ratio. We could set it at 1:1, or 1:4, or whatever we want. But we usually target 5% for false positives and 20% for false negatives. End of story. We usually express the power in terms of true positives, and 100% - 20% = 80%. So, 80% power is our “magic number!” (Recall that if the null hypothesis is false, we can either get a true positive or a false negative. Given that the null is false, the probabilities of true positives and false negatives must sum to 100%.)

The above figure illustrates the tradeoff between false positives and false negatives (based on a t-test design with n = 50 per group and a small effect size). Power is 1 minus the false negative rate. You can see that if we reduce our false positive rate to near zero, the false negative rate approaches 100%. In other words, if we are too conservative and refuse to declare even a very small p-value to be significant, we will have almost no power to detect a true effect if it does exist. On the other hand, if we want our power to be high, we have to accept a higher proportion of false positives.

Why do we care about power?

If that wasn’t enough to convince you how important statistical power is, let’s think about what happens when we do a study that is underpowered: a study that has only a low probability of detecting a significant effect, if that effect truly exists in the world. Let’s imagine we do such a study and we get a negative result; we do not declare statistical significance and we can’t reject the null hypothesis. Because the power is low, we don’t know whether it is a true negative or a false negative. In other words, we don’t know if the effect really is zero or nearly zero, or if we only failed to detect it because of natural variation/measurement error/biased sampling/etc. Obviously that is not good. We didn’t add anything to humanity’s knowledge. We might as well have not done the study in the first place and saved the time, money, resources, pain, and suffering.

But what if we do an underpowered study and we do get a positive (statistically significant) result? Well, some people would say, then the study was powerful enough. We only need to care about power if we don’t get a significant result. Unfortunately, they’re wrong. Why? Because at small sample sizes when power is low, we can actually overestimate the effect size because of natural variation and measurement error, just as easily as we can underestimate it. So if the study is underpowered and we see a significant result, we also don’t know if it’s a true positive or a false positive. As power gets lower, even though the rate of false positives is fixed (at let’s say 0.05), the proportion of false positives among the true positives goes up. So if we see a single positive result, we have less chance of knowing if it’s true if the study has low power. This is a big problem because of publication bias, where people are more likely to publish results from underpowered studies if they observe a positive result by chance. The lesson here is that we shouldn’t trust single reports from low-powered studies, whether positive or negative.

But as a counter to everything I just said, low power can be OK. What if your experimental unit costs $1,000,000 each? What about $1,000,000,000? What if you are studying a global phenomenon? After all there’s only one Earth, so we cannot sample from a population of Earths. It is not always wrong to do an underpowered study. It may the only way we can learn about a particular system. You have to be especially precise with your measurements and especially careful to be sure that you have a representative sample because the “costs” of measurement error and sampling bias are that much greater. It also means you have to be very careful about the conclusions you draw, and very explicit about the limitations of your inference.

What you think power analysis is

What power analysis actually is

Power analysis is iterative

The common conception of power analysis is that it is a simple linear process. You start with a research question, do a power analysis, and out pops the number of samples/participants/replicates you need. In fact, power analysis should be an iterative, cyclical process. You might have to go back and modify your research question based on the results of a power analysis. You might need to change the sample size based on what is feasible. You might need to reconsider what size of effect you need to detect. And so on. Credit to Jessica Logan for the images above.

Power analysis is a ballpark figure

If you knew exactly how large the effect is in your study system, you would know exactly how many replicates you would need to detect that effect with adequate statistical power. But if you knew exactly how large the effect is in your study system, you wouldn’t need to do the study in the first place! This means that power calculations are at best rough estimates.

Consider the possible outcomes:

If you collect too many samples, excess resources are wasted and some study subjects may be harmed unnecessarily, because you could have gotten the same knowledge with fewer samples. This is not ideal but at least we learned something.
If you collect the “just right” number of samples, you have used exactly the resources necessary to obtain the knowledge. That’s great, but as we’ve seen there is no way to know that number in advance!
If you collect too few samples, the study is inconclusive. We don’t gain any knowledge. All the resources spent on it are wasted, and all harm done to study subjects is in vain.

Because the outcome when you collect too few samples is worse than the outcome when you collect too many samples, you should proceed conservatively with power analysis. Erring on the side of collecting too much data is better than too little.

There’s no free lunch in power analysis

The amount of work you put into a power analysis determines the quality of what you get out of it. As we will see, you can do a decent credible power analysis simply by making up plausible numbers and “selling” them. But the more background research you do, the higher the quality of the power analysis . . . and the higher the quality of the resulting science. This may be hard work!

When a scientist approaches me, the statistician, and says “Do a power analysis for me,” without giving any context, I can give them numbers but they may not be good quality numbers. There is no guarantee that the power numbers I give them will represent the true power of the study. But if the scientist provides information like what magnitude of difference they want to detect, pilot data from their own lab, or relevant studies from the literature, I can give you a much better power analysis that truly reflects your specific question in the context of your specific study system.

What do we need to know to do a power analysis?

Now that you are (hopefully) convinced that power analysis is a good idea, let’s learn how to do it. What numbers do we need to know to do a power analysis? It depends on what our knowns and unknowns are. What are the key quantities that we may or may not know as we begin to design a study?

Significance level: In most classical statistical analyses done in ag science, we assume $\alpha = 0.05$. But the $\alpha$ level may have to be adjusted if we are doing multiple comparisons. Doing that adjustment is something that’s often (wrongly) neglected in power analysis.
Power: Sometimes we do a power analysis assuming a desired level of statistical power, often something in the neighborhood of 80%. Or we may calculate the statistical power to detect a given effect at a given sample size.
Sample size: We may know exactly how many samples or replicates we can afford to collect data on, based on the limitations of our experimental setup or study system. In that case, we can calculate one or more of the other quantities, taking sample size as known. Or alternatively, we may calculate the minimum sample size required to get a certain level of statistical power. Usually, we have a range of feasible sample sizes, not just a single possible number.
Effect size: The size of the effect that is practically meaningful and that you want to be able to distinguish statistically. As with sample size, usually this is not a single number but a range. This is probably the most important of all the quantities, and the one that takes the most effort to pin down.

We’ve already gone into significance level and power, which are related to the false positive and false negative rates, respectively. Let’s talk a little more about sample size and effect size.

Sample size: the more the better, up to a point

When doing statistical inference, we take a sample, calculate some statistics on it, and use those statistics to estimate the parameters (true values) of the population we took the sample from. The laws of statistics tell us that the more samples we take, the closer those statistical estimates get to the true population parameters. If the accuracy of our estimates always kept increasing at the same steady rate, we would never need to do a power analysis. Just collect as much data as you can afford to, and the more data, the more accurate the estimate! But in reality, the rate at which our estimates get more accurate starts to slow down as we collect more and more data. Eventually, we asymptotically approach the true population parameter. That applies to statistical power as well. At some point, the power gets so close to 100%, or alternatively the false negative rate approaches so close to 0%, that we can’t get any more benefit by adding more data.

The above figure illustrates the concept. Again I am showing the same curve for a t-test design with moderate effect size. Now we not only show the tradeoff curve for n = 50 per group, but also n = 100, 150, and 200. As the sample size increases, the curve changes less and less for each additional 50 samples collected. For a given false positive rate, once the false negative rate is near zero, adding more data can’t make it go down any more. So, as we collect more data, we get diminishing returns of how much each new data point contributes to the accuracy of our estimates. This is good because it means we can figure out where along the curve we want to (or can afford to) position ourselves to get the most accurate estimates for the least investment of time and resources. All the power analysis we’ll do today depends on that basic insight.

Effect size: what is it and why is it so important?

Effect size is a measure of the magnitude of some effect or relationship between variables. For example, it could be the difference in the mean biomass of the control group and the treatment group. It could be the slope of the relationship between dose of a drug and some physiological response. It could be the ratio of the odds of mortality between two different treatments. We can’t talk about statistical power without knowing something about, or making some assumptions about, effect size. Statistical power is the probability of declaring an effect to be statistically significant, if the true effect is of a specific size. Thus, a particular study design has different power, depending on how big the effect is. The stronger the effect, the easier it is to see that effect above the random noise and variation of the system. That means that statistical power increases as the size of the effect increases.

The above figure shows yet again the power for a t-test design. This time we have n = 50 per group in all three cases, but we have a separate curve for small, medium, and large effect size. You can see that if the effect we’re trying to detect is large, we have it easy. We can set our false positive rate to a very low value, even lower than 0.05 if we want, and we will still have a very low rate of false negatives. Low rate of false negatives = high power. But if the effect we’re trying to detect is small, we will have a very high rate of false negatives if the false positive rate is set to 0.05. In other words, most of the time that the effect is really present in the population, we will not find a statistically significant effect. The only way we can get around this, if the effect truly is small, is to collect more data.

Standardized effect sizes

Effect sizes are almost always presented in standardized units. This is because different studies may measure response variables in different units. For example, one study might measure biomass in kilograms and another study might measure height in centimeters. If we convert those response variables to standard deviation units, we can compare results that are otherwise on different scales. If a treatment increases biomass mean by two standard deviations relative to the control, and the same treatment increases height mean by only one standard deviation relative to the control, we can say that the effect size for biomass is twice as big as the one for height.

What are some commonly used effect sizes?

There are many different effect sizes published in the literature. Choosing which one is appropriate for your study depends on what kind of data you have. Here are some effect sizes you’ll probably want to use in typical situations. There are many more but this is a good starting point.

Effect	Effect size metric	Description	Example
Difference of the mean of two groups	Cohen’s d	The difference between the two means divided by their combined standard deviation. How many standard deviations greater is one mean than the other?	The effect of a management intervention on biomass yield
Difference in a binary outcome between two groups	Odds ratio	The ratio of the odds of the outcome happening in one group, to the odds of it happening in another group	The effect of a seed treatment on seed germination rate
Strength of relationship between two continuous variables	Pearson correlation coefficient r	You should know this one!	The effect of temperature on soil respiration
Variance explained by treatment group when comparing three or more groups	Cohen’s f	(Square root of) the ratio of variance explained by group, to the residual variance not explained by group	The effect of multiple genotypes and environments on a trait

Some of the effect sizes are based on the magnitude of the difference between means, and some are based on the proportion of variance explained by the statistical model.

What software should I use for power analysis?

There are a lot of software packages out there to help you do a power analysis. If you use R, the packages I recommend for power analysis are:

WebPower: formulas for simple power analyses
pwr: formulas for simple power analyses
SuperPower: simulation-based power analyses
simr: simulation-based power analyses
pwr4exp: brand-new package for mixed-model power analyses

WebPower is not only an R package but also an online calculator, and there are quite a few others out there. We’ll use WebPower and simr in this lesson, but I recommend checking the other ones out as well. I became aware of pwr4exp just after finishing these lesson materials, but I really like it and I will incorporate it into future versions of this lesson.

In SAS, you can use proc power and proc glmpower, which are very straightforward. There are also some clever tricks to estimate power for mixed model designs using proc glimmix. My ARS statistician colleagues are working on some training materials for SAS power analysis with glimmix. I will link to them here when they become available.

A note on post hoc power analysis (and why it is completely pointless)

Before we get into the nitty gritty of actually doing power analysis, I want to make one final point because it addresses a question I often field from scientists.

Reviewers in applied science journals often ask scientists to do a power analysis on a study after the fact, especially if no significant difference was observed. This makes no sense whatsoever. Why? Well, if you do a power analysis assuming the observed difference is the true difference, and you didn’t detect that difference, then you have too low of power. If you did detect the difference, then you have enough power. However, if the sample size is low, we don’t know whether the observed difference is true or not. So doing a power analysis after the fact makes no sense. If the study was underpowered to detect the effect of interest, you already knew that for a fact before you started the study. You already knew the chance that your negative result would be false, given that particular effect size. That’s it and that’s all.

We can also show this from a mathematical standpoint. Post hoc power calculations are completely determined by the p-value. See this blog post by Daniel Lakens with a simulation demonstrating that relationship. That means if we have the p-value we learn nothing additional from the power analysis. So next time you are asked to do a retrospective power analysis on a study you already did, tell ’em no, and tell ’em I sent you!

Estimating effect sizes

The first step in our hands-on demonstration of power analysis is also the first step in a real-life power analysis: estimating effect size.

How to get information about effect size

There are several potential ways to get information you can use to estimate effect sizes.

Data from your own lab: Calculate effect sizes from your own research group’s data, either from previous studies in a similar system, or pilot data collected in the planning stages of a study
The literature: Calculate effect sizes from the results in published literature or white papers, or from raw data accompanying publications
Prior external information: Use knowledge or judgment about what level of difference is biologically/practically/economically important (such as from stakeholders)
Effect size benchmarks: If all else fails, there are conventional thresholds for small, medium, and large effect sizes

Let’s go through examples of these. As we go along, keep in mind that a typical power analysis won’t result in one single number for effect size or power. You will usually want to construct power curves that cover a range of assumptions.

Effect sizes from data

A good way to get an idea of what effect size you’d like to detect in the study you’re planning is to calculate some effect sizes from data you already have. If you can argue that those datasets are at least roughly comparable to what you’ll get in your proposed study, then you can design that study around that effect size. For example you might have a dataset where your variable of interest is measured in a similar system. Or you might have a dataset where a similar variable is measured in your study system of choice. Both of those might be useful to get a rough estimate of what effect size you should target.

Let’s finally start doing some coding. Load the necessary packages for this lesson. Some of the packages have example data that we will be using, and others actually implement different aspects of power analysis.

library(agridat)
library(effectsize)
library(lme4)
library(emmeans)
library(dplyr)
library(ggplot2)
library(WebPower)
library(simr)

Our first example dataset is from an experiment in beans, the dataset besag.beans from the agridat package. The actual study is more complicated, but for this example just think of it as measuring yield of a bunch of different bean genotypes.

In this example, we’re going to ignore the blocking structure of the design, and the spatial layout of the plots, to estimate the effect size. That’s actually what I might often do in practice as well. Remember, power analysis is only a ballpark estimate. If our primary research goal were to precisely estimate the mean yields of each bean genotype from this bean yield dataset, we would want to carefully construct a statistical model that reflects the experimental design as well as possible. But our goal here is just to get a crude estimate of the effect size so that we can design another experiment. We’ll likely add a “fudge factor” to the effect size estimate we get from looking at this pilot dataset. As a result, we don’t have to worry too much about building a perfect statistical model. Close is good enough.

Looking at the first few rows of data, we can see there is a factor column called gen with genotype ID, and a numeric column called yield.

data(besag.beans)

head(besag.beans)

##       gen height yield row rep col
## 1   Maris     92   395   1  R0   1
## 2   Maris     90   350   1  R1   2
## 3   Dwarf     37   230   1  R1   3
## 4  Minica     66   355   1  R1   4
## 5  Stella     98   370   1  R1   5
## 6 Topless     66   280   1  R1   6

Fit a simple linear model, and use the anova() function to get the sums of squares for the one-way ANOVA of yield by genotype.

besag_fit <- lm(yield ~ gen, data = besag.beans)

besag_anova <- anova(besag_fit)

We will manually calculate Cohen’s f statistic. It is the square root of the ratio of the treatment sum of squares to the error sum of squares. Here the “treatment” is genotype.

sqrt(besag_anova$`Sum Sq`[1] / besag_anova$`Sum Sq`[2])

## [1] 0.8466414

We don’t have to remember that derivation every time, because the function cohens_f() from the effectsize package will calculate the effect size for each parameter in a linear model. As you can see we get the same result of 0.85, with a one-sided 95% confidence interval thrown in.

cohens_f(besag_fit)

## # Effect Size for ANOVA
## 
## Parameter | Cohen's f |      95% CI
## -----------------------------------
## gen       |      0.85 | [0.67, Inf]
## 
## - One-sided CIs: upper bound fixed at [Inf].

As I mentioned above, we probably shouldn’t design the study to yield exactly 80% power for f = 0.85. That’s because we just have a single estimate of effect size. The true effect in your future study may be smaller than that one specific estimate. You have a lot of leeway in what effect size to pick. You could probably justify the lower confidence bound of f, which is 0.67. To be more conservative than that, you could design the study to detect an effect about half as large as the effect observed in the previous study (0.4 to 0.45). There is no single correct answer here.

Effect sizes from the literature

Often, you may not have any of your own pilot data from a similar system to get an estimate of effect size for a power analysis. In that case, the next best thing is to find some studies from the literature that are on a similar topic or done in a similar system. If those studies have published all their raw data (as everyone should be doing by now) you can just proceed as above and calculate the effect size from the data. But if all you have access to are their summary statistics or figures that are published in a paper, you will have to use those to estimate the effect size. By the way, if you’ve ever done a meta-analysis, you will be familiar with trying to extract effect sizes from the literature. This is nothing more than a quick and informal meta-analysis to get a rough range of effect sizes to inform a power analysis.

How many studies is enough? It’s quality over quantity. Two or three studies measuring a response similar to your target response, in a system similar to your study system, are better than many studies that aren’t so closely related. There is no need to exhaustively search the literature. I think it is fine to base your preliminary estimates of effect size even on one single study, as long as you have good justification for saying it is “typical” or representative of what you are expecting to encounter. If you are conservative about your choice of sample size, you are on solid ground.

Here’s an example of how we might go about getting effect sizes from a few studies on a similar topic. Let’s say we are interested in testing, in our particular study system, whether transgenic Bt crops have different abundance of soil organisms than isogenic lines. Even though this is a very well-studied topic, we might want to measure that response in a new context where it hasn’t been measured before. So, because there are so many previous studies of the same or similar response, in similar systems to the one we are planning to do our new study in, we can get a good idea of the range of effect sizes we can expect. We’ll design our study to detect an effect in that range.

In this example, we are looking at the difference of two means. Therefore we will use Cohen’s d as our effect size. This is the standardized mean difference, which is just the difference between the two means, divided by the pooled or “average” standard deviation of the two means. It’s roughly how far apart the two means are, in standard deviation units. The units of Cohen’s d are always the same regardless of what units the original data were collected in. We can gauge the size of an effect for each study and see how they compare.

Imagine that you combed the literature and found published estimates of soil organismal abundance compared between transgenic Bt lines and isogenic lines of the same crop. Here are a few tables where these estimates are published. (I selected these haphazardly from a much longer list that was compiled in a 2022 meta-analysis of the topic, and I’m only showing a small subset of the data in the tables.)

Frouz et al. 2008, Table 1

Frouz et al. 2008, data from Table 1

Cerevkova & Cagan 2015, Table 2

Cerevkova & Cagan 2015, data from Table 2

Arias-Martins et al. 2016, Table 4

Arias-Martins et al. 2016, data from Table 4

We’re going to make the assumption that the differences these researchers documented are close to the “true” differences. We are going to design our study with this rationale: if the true size of the Bt effect in our study system is roughly close to the observed effect size from similar systems in the past, we will design our study with enough replication to detect an effect size in that range with high (ideally >80%) power.

How do we go from these numbers to effect sizes? The first thing to notice is that some papers present more than one comparison because they measured abundance of multiple taxa on Bt and non-Bt maize. The second thing to notice is that for some of the comparisons, we have means and standard deviation. For others we have means and standard errors of the means. I’ve used the formula $SE = \frac{SD}{\sqrt{n}}$ to convert the standard errors to standard deviations. We are ignoring any potential issues with distribution of the data, whether or not the studies are properly replicated, and that kind of thing. Because we are only mining these studies for a very rough estimate of effect size, it is not worth it to subject them to that much scrutiny.

We use this formula for Cohen’s d effect size:

\[d = \frac{\bar{x}_1 - \bar{x}_2}{s}\]

Here, $\bar{x}_1$ is the mean of one group, $\bar{x}_2$ is the mean of the other group, and $s$ is the pooled standard deviation, which can be calculated from each group’s standard deviations $s_1$ and $s_2$ and their sample sizes $n_1$ and $n_2$. This is a way of averaging the two standard deviations, weighted by the sample size of each group, with a correction for small sample sizes:

\[s = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2} {n_1 + n_2 - 2}}\]

Here’s a function I wrote that calculates the Cohen’s d given the two group means, standard deviations, and sizes.

calc_Cohen_d <- function(xbar1, xbar2, s1, s2, n1, n2) {
  s <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
  (xbar1 - xbar2) / s
}

I’ve saved you some work of transcribing the means, standard deviations, and sample sizes from these tables into a spreadsheet format. Note you may import the data from a web URL or, if you are on the Posit Cloud server, from the data folder.

Bt_means <- read.csv('https://usda-ree-ars.github.io/SEAStats/power_analysis_tutorial/Bt_effect_size_data.csv')

After inspecting the data, use our calc_Cohen_d function to calculate Cohen’s d effect size for each row of the data frame.

Bt_Cohen_d <- with(Bt_means, calc_Cohen_d(xbar1 = Bt_mean, xbar2 = nonBt_mean, s1 = Bt_SD, s2 = nonBt_SD, n1 = Bt_n, n2 = nonBt_n))

First of all, notice that some of the effect sizes are undefined because we had a few variables where all zeros were observed for both groups. If this were our own data, we would worry about that. But again, because power analysis is at best a rough guess, we will just ignore those!

Bt_Cohen_d <- na.omit(Bt_Cohen_d)

We will take the absolute value of the effect sizes (positive means greater mean in Bt, negative means greater mean in isogenic) because we are interested in what size of an effect we can detect with a two-sided test. Make a histogram of the effect sizes, with annotations for the median, 25%, and 75% quantiles.

Bt_d_quantiles <- quantile(abs(Bt_Cohen_d), probs = c(.25, .5, .75))

ggplot(data.frame(d = abs(Bt_Cohen_d)), aes(x = d)) +
  geom_histogram(bins = 10) +
  geom_vline(xintercept = Bt_d_quantiles[1], color = 'forestgreen', linewidth = 1) +
  geom_vline(xintercept = Bt_d_quantiles[2], color = 'darkorange', linewidth = 2) +
  geom_vline(xintercept = Bt_d_quantiles[3], color = 'forestgreen', linewidth = 1) +
  theme_bw()

Here are the quantiles:

round(Bt_d_quantiles, 3)

##   25%   50%   75% 
## 0.266 0.460 0.544

So, what do we use as the effect size to design our study around? Well, that’s a good question. We could design a study that has 80% power to detect an effect size that’s roughly the median of the effect sizes we estimated from our meta-analysis ($d = 0.46$, which is considered a medium effect size), or to be more conservative, we could target a lower quantile. Keep in mind that this is only a rough guide to study design. Even if you are already locked in to a study design, this kind of meta-analysis can be helpful. For example, if you determined that your study has 80% power to detect effects of $d \geq 0.3$, you would be happy with these meta-analysis results. About 75% of the effect sizes from the literature were at least 0.3. If they are any indication of what you’re likely to observe, it means you have a good chance that the true difference will be detectable by your study. But if you determined your study can only detect $d \geq 0.6$ with adequate power, this meta-analysis should make you unhappy. It indicates that if the literature results are anything like your system, you probably won’t have enough power to detect the true difference. Back to the drawing board!

Effect sizes based on external criteria

Because you are an expert with a lot of domain knowledge about your study system, you may not need to derive effect sizes from your data or from others’ data. You may have an idea about what effect size, or range of effect sizes, are biologically/practically/economically relevant to detect. If you can justify that effect size, it’s reasonable, and even recommended, to design your study accordingly.

Often you can consult with stakeholders to determine effect size. For example, a stakeholder might say they would invest in a treatment that could increase yield by 2 bushels per acre, or decrease mortality by 25%. The main problem is often figuring out how to translate the stakeholder’s wishes into a standardized effect size metric. You still might need to delve into your own data to estimate the standard deviation needed to convert the effect from the original units into standardized units.

Effect size based on no prior information

People often tell me they do not have any pilot data, nor could they find any sufficiently relevant studies in the literature to estimate effect size. Usually, I would chalk this up to either unwillingness to do the work of searching the lit, or failure to think creatively enough, or both. To get a representative effect size, you don’t need to find a study that’s exactly the same as yours. It could be the same response variable measured on a different species, or a different variable measured on your same study species, etc. Either way it is close enough to get you in the ballpark.

But if we truly cannot find any reasonable prior data to use as an effect size, we still have to assume one, because power analysis requires an effect size. Even if you know the smallest possible difference you would be interested in detecting, in the units you are going to measure the response in, that still isn’t an effect size because effect sizes are standardized (normalized to be free of units). You might not know what the within-group variance is likely to be, relative to the size of the difference between groups. So we just have to make an assumption and stick with it. What most people do is consult a “benchmark” table for their effect size of choice. For example, here are the benchmarks that Cohen suggested in the original publication of Cohen’s d and f effect sizes from 1988:

d	f	Effect size
0.20	0.10	Small
0.50	0.25	Moderate
0.80	0.40	Large

The common practice is to target a “medium” or “moderate” effect size. This is generally a justifiable assumption. Designing a study to detect a small or weak effect requires a large sample size, and weak effects are by definition less important to detect anyway. Designing a study that can only detect large effects seems like a bad idea because it will miss anything that is not large. So medium or moderate it is!

Even though it seems tempting to save yourself a lot of work and use the benchmark effect size value all the time, I don’t recommend that. That’s because the benchmarks are not field-specific. It may not be appropriate to apply them indiscriminately to disparate fields of study. One field’s moderate effect might be another field’s huge effect. This is not a substitute for determining what effect size is practically meaningful through reanalyzing your own data, searching the literature, or consulting with stakeholders.

Power analysis using formulas

Now that we have an idea about effect size, let’s move forward with calculating statistical power. For certain simple experimental designs, there are analytical formulas for statistical power. The variables in the formula are:

Sample size
Number of groups being compared
Desired false positive rate (alpha, often set to 0.05)
Effect size
Statistical power

If we know all but one of those things, we can plug them into our formula and boom, out comes an estimate for the single unknown! For example:

We can input alpha, the effect size we want to be able to detect, and the sample sizes we can realistically get, and get an estimate of the statistical power of our study design
We can input alpha, the effect size we want to be able to detect, and the power level we’re targeting, and get an estimate of the minimum sample size needed to detect that effect size with the desired level of power
We can input alpha, power, and sample size and get an estimate of the smallest effect we can detect given that study design at the desired power level. This is called the minimum detectable effect size or MDES.

I call this approach “plug and chug” power analysis. It is often a good way to proceed. We’ll go through a few examples together.

Simple power calculations with WebPower

Let’s use WebPower to do some power calculations. WebPower has a lot of different functions that implement formulas for calculating power for different study designs. We will start out by playing with wp.anova(), which calculates power for a one-way ANOVA design. You can look at the other functions in the package by calling help(package = 'WebPower').

We’ll use wp.anova() to do the following, assuming we have an experimental design with four different groups that we are comparing (k = 4).

Calculate required sample size for 80% power, given effect size
Calculate minimum detectable effect size for 80% power, given sample size
Calculate power, given sample size and effect size

Here, we specify number of groups k = 4, a moderate Cohen’s f effect size f = 0.25, the standard significance level alpha = 0.05 and 80% power using power = 0.8. We specify type = 'overall' which will give us the power for the overall F-statistic in the ANOVA.

Note: You may also specify different arguments for type to get the power to detect a difference of a given size between two specific groups; see the help documentation.

wp.anova(k = 4, f = 0.25, alpha = 0.05, power = 0.8, type = 'overall')

## Power for One-way ANOVA
## 
##     k        n    f alpha power
##     4 178.3971 0.25  0.05   0.8
## 
## NOTE: n is the total sample size (overall)
## URL: http://psychstat.org/anova

We see that we need about 178 samples total, meaning 178/4 or ~45 samples per group, to have 80% power to detect a moderate difference between groups.

What if we can only afford 60 total, or 15 per group? How large of an effect can we detect with 80% power? We no longer specify f but we do specify n = 60.

wp.anova(k = 4, n = 60, alpha = 0.05, power = 0.8, type = 'overall')

## Power for One-way ANOVA
## 
##     k  n         f alpha power
##     4 60 0.4414577  0.05   0.8
## 
## NOTE: n is the total sample size (overall)
## URL: http://psychstat.org/anova

Now we get an estimate of ~0.44 for f. This means we only have >80% power to detect fairly large effects with that sample size.

Finally, let’s say that based on our exhaustive search of the literature and preliminary analyses of our pilot data, we find that we can expect effects in about the f = 0.4 range, and we can afford 60 samples total. How much power will we have? This time we specify every argument except power:

wp.anova(k = 4, n = 60, f = 0.4, alpha = 0.05, type = 'overall')

## Power for One-way ANOVA
## 
##     k  n   f alpha     power
##     4 60 0.4  0.05 0.7087216
## 
## NOTE: n is the total sample size (overall)
## URL: http://psychstat.org/anova

We see that we have ~71% power in this case, which is reasonably good.

Running WebPower iteratively to generate a power curve

As we’ve discussed, often sample size and/or effect size are not a single number you know in advance. There may be a range of feasible and justifiable sample sizes you can consider collecting. There may also be a range of biologically/practically/economically relevant effect sizes. To examine how our power or minimum detectable effect size varies over a range of assumptions, we can construct a power curve.

Let’s calculate the power for a number of combinations of different sample sizes and minimum detectable effect sizes, and plot a power curve for each of the effect sizes, going from low sample size to high. Put a dashed line at the “magic number” of 80% power.

The first two lines of code generate all combinations of effect sizes from 0.1 to 0.6 and number of replicates from 20 to 200 and puts them in a data frame. In the next line, we call wp.anova() but instead of passing a single value to n and f, we pass the columns of the data frame of combinations. Then we extract only the element called power from the object returned by the function and add it as a column of numeric power values ranging between 0 and 1 to our data frame.

The rest of the code just makes the plot.

ns <- c(20, 40, 60, 80, 100, 120, 140, 160, 180, 200)
n_f_grid <- expand.grid(n = ns, f = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6))

n_f_power <- wp.anova(k = 4, n = n_f_grid$n, f = n_f_grid$f, alpha = 0.05, type = 'overall')

n_f_grid$power <- n_f_power$power

ggplot(n_f_grid, aes(x = n, y = power)) +
  geom_line(aes(group = f, color = factor(f)), linewidth = 0.8) +
  geom_point(aes(fill = factor(f)), size = 3, shape = 21) +
  geom_hline(yintercept = 0.8, linetype = 'dashed', linewidth = 1) +
  scale_color_brewer(name = "Cohen's f", palette = 'Reds', aesthetics = c('color', 'fill')) +
  scale_x_continuous(breaks = ns) +
  theme_bw()

This is an interesting figure. It shows us that we can basically give up on detecting effects $f = 0.2$ or smaller. Even at large sample sizes, we have low power to detect those weak effects. As for the largest effects, $f = 0.5$ and greater, we have high power to detect them at even the smallest sample sizes we’re considering. But the scientifically exciting effects are in the middle! Depending on how much time, money, and resources we are willing to allocate, we can target effects in the range of $f = 0.3$ or $f = 0.4$.

Flipping the script, let’s calculate the minimum detectable effect size for a number of different sample sizes, targeting 80% power, and plot it. Put a dashed line at $f = 0.25$, the so-called “moderate” effect size. As we’ve discussed, what is “moderate” depends on your field of study! For consistency, I will still use the expand.grid() function, though it isn’t strictly necessary because we only have one variable to apply the wp.anova() function to.

anova_MDES <- expand.grid(n = ns)
anova_MDES$MDES <- map_dbl(anova_MDES$n, ~ wp.anova(k = 4, n = ., alpha = 0.05, power = 0.8, type = 'overall')$f)

ggplot(anova_MDES, aes(x = n, y = MDES)) +
  geom_line(linewidth = 0.8) + geom_point(size = 3) + 
  geom_hline(yintercept = 0.25, linetype = 'dashed', linewidth = 1) +
  scale_x_continuous(breaks = ns) +
  theme_bw()

We see that at $n = 20$ total, we only have 80% power to detect any effect $f = 0.84$ or greater. That’s an underpowered design for any effect that isn’t huge and obvious. This drops to $f = 0.31$ by the time we get to $n = 120$. The smaller the MDES, the better! We also see that the curve is flattening, indicating that as we add more reps, we get diminishing returns. Going from 120 to 200 only drops the MDES from 0.31 to 0.24. Ultimately, we can conclude that to detect something in the neighborhood of a moderate effect size, we will need somewhere between 120 to 180 total samples, or 30 to 45 for each of the four groups.

As I’ve said a few times already, power analysis is only a ballpark figure. So there’s no point in going out to too many significant figures when reporting a MDES or power value. These are rough estimates at best. The precise analytical value of $f = 0.3065215$ we get at $n = 120$ in the figure above looks nice, but when you consider the thin ice of assumptions it’s based on, the precision is meaningless. I would say something like “about 0.3.”

Power analysis using simulation

For many researchers, their experience with power analysis consists entirely of what we’ve done above. Pick an effect size (whether pulled out of thin air or derived from previous data or a meta-analysis of the literature) and plug-and-chug. But this type of quick and easy power analysis is really an anomaly. Most experimental and observational study designs we work with in ag science require mixed models. When you have random effects, there are no longer simple analytical formulas that you can plug numbers into and get power out of. This doesn’t mean we have to give up. It just means we have to do power analysis by simulation.

The basic idea of simulation-based power analysis is fairly simple.

Generate a dataset using a given set of assumptions (effect size, sample size, etc.)
Fit a statistical model to the dataset and calculate a p-value for your hypothesis of interest
Repeat n times
Your power is the proportion of times, out of n, that you observed a significant p-value

The hard part is coming up with the assumptions.

A simple simulation example: t-test design

Let’s start with a simple example. Let’s simulate data from a simple study design where we’re trying to test whether a control group and a treatment group have a different mean. We would use a t-test to test whether the difference is significantly different from zero. Let’s assume that individuals in each group have values that are normally distributed around the group’s mean. We also assume the following:

Treatment mean = 12
Control mean = 9
Standard deviation of both groups = 3
Sample size of both groups = 30

What is our power? First let’s simulate one dataset. We define the means, standard deviations, and sample sizes, set a random seed so we will get the same result every time, and then draw random numbers from two normal distributions. The function rnorm() has three arguments, n, mean, and sd. We do a t-test to compare the means of those two random samples, using the function t.test() with all default options.

mu_t <- 12
mu_c <- 9
sd_t <- 3
sd_c <- 3
n_t <- 30
n_c <- 30

set.seed(1)

x_t <- rnorm(n = n_t, mean = mu_t, sd = sd_t)
x_c <- rnorm(n = n_c, mean = mu_c, sd = sd_c)

t.test(x_t, x_c)

## 
##  Welch Two Sample t-test
## 
## data:  x_t and x_c
## t = 4.2663, df = 56.741, p-value = 7.616e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.511666 4.186435
## sample estimates:
## mean of x mean of y 
## 12.247375  9.398324

Hey, it’s significant at alpha = 0.05! Cool … I guess?

But what’s the power? Remember that power is the probability of detecting an effect, if the effect exists in reality. We can’t get a probability out of a single yes-or-no result. We can’t estimate the probability until we simulate many times. Let’s repeat all of that 10000 times, and save the p-value each time. We do this by creating p_vals, a vector of zeros that is 10000 elements long. Then we use a for loop, which repeatedly executes the same code 10000 times, increasing the value of i by 1 each time. Each time, we generate a new set of random numbers and store the p-value in element i of the vector p_vals. After the loop is done, we calculate the proportion of the time that $p < 0.05$, which is our power.

n_sim <- 10000
p_vals <- rep(0, n_sim)

for (i in 1:n_sim) {
  x_t <- rnorm(n_t, mu_t, sd_t)
  x_c <- rnorm(n_c, mu_c, sd_c)
  p_vals[i] <- t.test(x_t, x_c)$p.value
}

mean(p_vals < 0.05)

## [1] 0.9682

We have 96.8% power! How does this compare to the analytical result of wp.t() from the WebPower package? Remember that Cohen’s d is the difference between the two means divided by their pooled standard deviation. Because standard deviation is 2 for both groups, the pooled SD is 2 as well. So $d = (12-9)/3 = 1$. In other words the two means are one standard deviation apart. This is a “large” effect by Cohen’s benchmarks.

wp.t(n1 = 30, n2 = 30, d = 1, alpha = 0.05)

## Two-sample t-test
## 
##      n d alpha     power
##     30 1  0.05 0.9677083
## 
## NOTE: n is number in *each* group
## URL: http://psychstat.org/ttest

We get ~96.8% here too. The simulation-based power analysis agrees with the analytical formula, because we did enough simulations for the empirical power estimate to approach the “true” value. This should be reassuring. But we can’t always check our work so easily against an analytical formula, because they don’t exist for most of the models we use every day in ag science.

Power by simulation for mixed models

Many observational and experimental studies in ag science need to be analyzed with mixed models: statistical models that include random effects to account for groups in the data. But if you have a study design with random effects, you don’t just have to know effect size, sample size, and Type I error rate to calculate power. Models with random effects are also called “multilevel” models because you don’t just have one variance parameter for the residuals. You also have variance of the random effects, which you have to estimate if you want to know the power of your study design. For example, if you have a randomized complete block design (RCBD), there is variance between the different blocks (each block has its own mean) and variance between the individual measurements within a block (this is captured by the residuals of the model). Your statistical power will depend not only on the variation between individuals, but also on how much variation there is between the blocks. All else being equal, if the variance between blocks increases, your power will decrease.

If you have a blocked design or otherwise require random effects for your statistical model, you can’t do a power analysis without having some idea, or making some assumptions, about how different the blocks are from each other. And if you have a complex design like a split-split-plot or an RCBD with repeated measures in time, you will have more than one random effect variance component that needs to be accounted for!

What do we do? Well, this should make it even more obvious to you that there is no real substitute for getting good pilot data from the literature or your own previous work. If you have done similarly designed studies in your own system in the past, you can just get the variance components from those old datasets and use them in your simulation. To get that information from the literature, you’ll probably need to find published datasets. Thankfully, as the momentum toward open science grows, this is becoming easier and easier.

Again, there are no easy formulas for calculating power for mixed-model designs. We have no choice but to do power analysis by simulation. Let’s go through a few examples of how we might design an experiment, using simulation-based power analysis to determine the optimal number of samples.

Example 1. Number of blocks for a randomized complete block design

Let’s say we are planning a multi-environment wheat yield trial and we want to know how many replicates to include in our randomized complete block design. We will imagine we have some previous data: we’ll use an example dataset from the agridat package, linder.wheat. It is a wheat yield trial replicated in seven environments. At each environment, a randomized complete block design was planted, with four replicates of nine genotypes.

Let’s fit a linear mixed-effects model to the data with genotype gen, environment env, and their interaction gen:env as fixed effects, a random intercept for block nested within environment, and yield as the response variable.

data(linder.wheat)

linder_lmm <- lmer(yield ~ gen + env + gen:env + (1 | block:env), data = linder.wheat)

Now, we can take a look at the relevant parameters of the model. joint_tests() gives us the analysis of variance table with F-tests, which shows us that environment explains much more variation than genotype or G by E interaction. VarCorr() gives us the variance components, one for block and one for the residual variance. The variance due to block is substantial.

joint_tests(linder_lmm)

##  model term df1 df2 F.ratio p.value
##  gen          8 168   2.119  0.0366
##  env          6  21  43.917  <.0001
##  gen:env     48 168   2.837  <.0001

VarCorr(linder_lmm)

##  Groups    Name        Std.Dev.
##  block:env (Intercept) 22.059  
##  Residual              37.176

Now let’s say we wanted to determine how much power we have to detect a genotype effect in the model, if we varied the number of replicates (blocks) per environment. We can do this by using the statistical model we fit already as the basis for a simulation. Using the simr package, we can simulate random effects for new datasets with different numbers of blocks. We do this many times for each number of blocks we want to test. Then, we perform our statistical test of choice on each of those simulated datasets and use this to calculate power. Let’s say we wanted to estimate power for every number of blocks from 3 through 8. We want to estimate the power for the F-test for genotype that we saw in the ANOVA table just now.

First we use the function extend() to create a new model object that can be used to simulate. We use the argument along = 'block' to specify that we want to add new levels to the block variable, and n = 8 to say we want 8 levels of block. This new object doesn’t contain new data, but it can be used to simulate data with the same structure as the original data, but with variable numbers of blocks all the way up to 8.

Next, we use the powerCurve() function to do the simulation. We will do this for 10 simulations per number of blocks as a toy example (nsim = 10). Again, we specify along = 'block' to tell the simulation to simulate datasets with different number of levels of block. The test argument says we’re testing the fixed effect of genotype using the F-test. We set a random seed so that we all get the same results.

linder_lmm_extended <- extend(linder_lmm, along = 'block', n = 8)

linder_pcurve <- powerCurve(linder_lmm_extended, 
                            along = 'block', 
                            test = fixed('gen', method = 'f'), 
                            nsim = 10, 
                            seed = 3)

We can display the calculated power numbers as well as their 95% confidence intervals, either as text output or with a default plot.

linder_pcurve

## Power for predictor 'gen', (95% confidence interval),
## by number of levels in block:
##       3: 80.00% (44.39, 97.48) - 189 rows
##       4: 90.00% (55.50, 99.75) - 252 rows
##       5: 100.0% (69.15, 100.0) - 315 rows
##       6: 100.0% (69.15, 100.0) - 378 rows
##       7: 100.0% (69.15, 100.0) - 441 rows
##       8: 100.0% (69.15, 100.0) - 504 rows
## 
## Time elapsed: 0 h 0 m 30 s

plot(linder_pcurve)

I would not recommend doing only 10 simulations per sample size in practice. Here is what it looks like if you do 1000 simulations per sample size (this takes about half an hour on my laptop, though this could be sped up with some tricks to run the simulations in parallel).

linder_pcurve_1000 <- powerCurve(linder_lmm_extended, 
                                 along = 'block', 
                                 test = fixed('gen', method = 'f'), 
                                 nsim = 1000, 
                                 seed = 3)

plot(linder_pcurve_1000)

Doing more simulations gives us narrower confidence intervals on our power estimates. We see that three blocks is probably too few, and we still get a substantial boost in power going from four to five. Above five, we reach the point of diminishing returns.

Bonus exercise: try substituting 'gen:env' for 'gen' in the code to estimate power for the G by E interaction effect.

Example 2. Number of replicates and number of animals per replicate for a binomial GLMM

Here, we’ll imagine we are interested in testing cattle for some kind of disease. We are going to sample herds of cattle and test them for the disease, but we don’t know how many herds to test or how many cattle per herd. For this, we are going to use another example dataset. This is the cbpp dataset from the lme4 package. It gives us the incidence of a respiratory disease of zebu cattle in Ethiopia. There are 15 different herds with variable numbers of cattle, each of which were sampled at four different time periods. At each time period, the number of cattle with and without symptoms of the CBPP disease were counted. The research question is whether there are differences in the proportion of cattle with CBPP between time periods. The mean probabilities for each herd are considered to be drawn from a normal distribution (herd has a random intercept) but the effect of time period is considered to be the same in each herd (there is no random slope term).

First, we need to convert the data to long form so that each row is an individual animal at a single time point, with 0 in the disease column if it is uninfected and 1 if it is infected.

data(cbpp)
cbpp <- cbpp %>%
  group_by(herd, period) %>%
  reframe(animal = 1:n(),
          disease = rep(c(0, 1), c(size - incidence, incidence)))

Next, we fit a binomial generalized linear mixed model (GLMM) to the data, because the response variable is binary (0s and 1s).

cbpp_glmm <- glmer(disease ~ period + (1 | herd), data = cbpp, family = binomial)

We will extend the model in two different ways. First, we will extend the number of herds up to 20. The second extension uses the within argument instead of along. Now instead of increasing the number of herds, we are increasing the number of cattle in each combination of period and herd, all the way up to 50. Note that if you want to simulate increasing numbers of samples within combinations of grouping variables, you need to separate the different group names with a + as we do here: 'period+herd'.

cbpp_glmm_n_herds <- extend(cbpp_glmm, along = 'herd', n = 20)
cbpp_glmm_n_within_herd <- extend(cbpp_glmm, within = 'period+herd', n = 50)

Do two simulations, one to test the effect of increasing the number of herds, and the other to test the effect of increasing the number of cattle within herds, keeping the number of herds fixed at the original 15. In both cases we are using the Chi-squared test for the fixed effect of time period. For the second simulation, I have specified specific values of cattle number to test using breaks. In previous simulations we have used the default breaks which are no greater than 10 evenly spaced values going up to the maximum.

cbpp_pcurve_n_herds <- powerCurve(cbpp_glmm_n_herds, 
                                  along = 'herd', 
                                  test = fixed('period', method = 'chisq'), 
                                  nsim = 10, 
                                  seed = 4)

cbpp_pcurve_n_within_herd <- powerCurve(cbpp_glmm_n_within_herd, 
                                        within = 'period+herd', 
                                        test = fixed('period', method = 'chisq'), 
                                        nsim = 10, 
                                        seed = 5, 
                                        breaks = c(3, 5, 10, 15, 25, 50))

Take a look at the power curves.

plot(cbpp_pcurve_n_herds)

plot(cbpp_pcurve_n_within_herd)

Again, this gives the basic idea but the 10 simulations per number of levels is not really adequate. Here are the results with 1000 simulations (don’t run this code unless you want to sit there for an hour or two).

cbpp_pcurve_n_herds_1000 <- powerCurve(cbpp_glmm_n_herds, 
                                       along = 'herd', 
                                       test = fixed('period', method = 'chisq'), 
                                       nsim = 1000, 
                                       seed = 4)

cbpp_pcurve_n_within_herd_1000 <- powerCurve(cbpp_glmm_n_within_herd, 
                                             within = 'period+herd', 
                                             test = fixed('period', method = 'chisq'), 
                                             nsim = 1000, 
                                             seed = 5, 
                                             breaks = c(3, 5, 10, 15, 25, 50))

plot(cbpp_pcurve_n_herds_1000)

plot(cbpp_pcurve_n_within_herd_1000)

Here, we see that statistical power increases at the expected diminishing rate with increased number of herds. I also wanted to demonstrate that the number of cattle per herd is less important than the number of herds. We see that for the fixed number of 15 herds, we quickly reach maximum power at all but the very smallest number of cattle per herd. This illustrates the point that it is usually more cost-effective to increase the number of replicates/blocks/herds/groups, than the number of individuals within the replicate or group. The replicates or groups are independent experimental units. Getting more measurements within each of those groups improves the precision of the estimate for each group but doesn’t increase the number of independent samples. It is the number of independent samples that really determines power.

Bayesian power analysis

Our hands-on demonstration of power analysis is now complete. Before concluding, here is a quick note on Bayesian power analysis. If you know me, you know I am a strong proponent of Bayesian methods. But I have shown you classical or frequentist power analysis methods here. If you do go the Bayesian route, can you still do power analysis? The answer is yes, but it’s different. I know many of you aren’t into Bayesian analysis, so I will not go into great detail here.

In some sense, power analysis does not make sense in a Bayesian context because the goal of pure Bayesian analysis is not to reject a null hypothesis, it is to update our belief about a parameter. We are just looking to estimate the posterior distribution of an effect size of interest, not to say whether it’s significantly different from zero or any other arbitrary number. Within that framework, concepts like true positive and false positive don’t have any meaning.

But obviously, regardless of what statistical methodology we’re using, we want to design studies with a high probability of giving us the answer we want, with a reasonable use of time, money, and resources. Whether Bayesian or not, this requires making an informed decision about sample size ahead of time. For Bayesian power analysis, we can base our decision off of what kind of estimate we want for the posterior distribution. For example we could base it off of what we want the width of the 95% credible interval of the posterior distribution of the parameter estimate to be.

And a second point is that many people do indeed use Bayesian analysis to ask frequentist-style questions and test hypotheses. If you have done any of my Bayesian workshops, you will remember us talking about “maximum a posteriori p-values.” Those are nothing more than Bayesian null hypothesis tests. So if you are willing to test a null hypothesis using a Bayesian model, then you can meaningfully talk about power. It is no problem to do a simulation-based power analysis, in that framework. However it can be computationally intensive. We won’t go through any examples of that today, but I would be willing to work with scientists individually to do a custom Bayesian power analysis.

Conclusion

Today we’ve learned some important big-picture ideas about power analysis:

Power analysis is essential for doing good science because it tells you how much you can trust the answers you get to the questions you care about
Determining the relevant effect size that is practically meaningful is the most difficult part of power analysis, but also the most important
The effort you put into a power analysis determines the quality of what you get out of it: the better informed your effect sizes are by real data, and the more explicit you are about the assumptions you are making, the more realistic is your power analysis
Power analysis can be done using special analytical formulas in some simple cases, but often requires you to simulate data based on assumptions

We’ve also learned some practical skills for doing power analysis using R:

Estimating effect sizes from existing datasets
Calculating statistical power, minimum detectable effect size, or sample size for simple study designs using formulas
Simulating data from a model to estimate statistical power for common mixed-model designs

I will leave you with some final recommendations.

First, I’ve said this already today in a lot of different ways, but there is no magic shortcut in power analysis. The more effort you put into a power analysis, the better. The more informed it is by literature, the better results you will get.

Another important point to keep in mind is that you have to design your study around the weakest effect you want to be able to detect. You might have adequate power to detect differences in some of the variables you’re measuring, but not others. Of course, you may not always know everything you are going to measure ahead of time. It’s fine to decide mid-study that you want to measure some new responses, but you still have to consider power in that case.

I want to also briefly address something that people ask me about a lot, which is how to do power analysis for omics and other multivariate datasets. A lot of multivariate analyses that we do are more about describing variation than testing a specific hypothesis, so it’s kind of hard to do power analysis. Again, having previous data helps a lot. You could use previous data to get a rough idea of the variation you expect to see when you collect your new data, or as a basis for simulation. You could also base the power of your design on detecting differences in just one of the many response variables in the ’omics dataset (e.g., changes in the abundance of one metabolite in a metabolomics dataset), with appropriate correction for false discovery rate. There are many ways to go about it, any of which you could potentially justify with some quantitative determination of power. In the end, you will have to “sell” your justification for sample size not only to reviewers but also to yourself. Whether it’s omics or anything else, the question you should be asking is, how confident will I be in the results of this study?

I’ll end with this: Power analysis seems like a bureaucratic requirement or only something that statisticians care about. I beg you not to think of power analysis as a box you have to check. In fact, it is a critical part of the scientific method. After all, power means getting reliable answers to the scientific and practical questions you care about. No power = no new knowledge = no science.

Thank you.

Exercises

Here are a few optional exercises if you would like to practice what you’ve learned about power analysis.

Exercise 1

Calculate the Cohen’s d effect size given the following:

Mean of control group = 45
Mean of treatment group = 35
Standard deviation of control group = 10
Standard deviation of treatment group = 5
Size of control group = 50
Size of treatment group = 100

Hint: Use the calc_Cohen_d() function defined earlier in this lesson.

Exercise 2

Use the function wp.correlation() from WebPower to calculate the statistical power to detect correlations of 0.2, 0.4, and 0.6 at a sample size of 50, with significance level 0.05 and a two-sided test. All other default options should not be changed. You may need to look at the function documentation by typing ?wp.correlation.

Exercise 3

In this exercise, we will do a simulation-based power-analysis using the yates.oats dataset from agridat. These are the results of a split-plot field trial with 3 different genotypes (gen) in the main plots and 4 different levels of manure nitrogen (nitro) in the subplots. There are 6 replicated blocks. First, convert nitro to a factor variable and fit a linear mixed model with genotype, nitrogen, and their interaction as fixed effects, and random intercepts for block and for main plot (gen:block).

data(yates.oats)
yates.oats$nitro <- factor(yates.oats$nitro)

yates_lmm <- lmer(yield ~ gen + nitro + gen:nitro + (1 | block) + (1 | gen:block), data = yates.oats)

Generate a power curve showing the power to detect a significant main effect of nitrogen, using an F-test, for numbers of blocks up to 50. Use 10 simulations per number of blocks.
Generate another power curve, but this time show the power to detect a significant genotype by nitrogen interaction effect, using an F-test. Hint: 'gen:nitro' should be used as the name of the fixed effect you are testing.
What can you conclude?

Click here for answers

Power analysis tutorial

Quentin D. Read

March 10, 2025