After each stats workshop I’ve taught, I’ve included a question on the post-class survey asking whether there are any other topics people want me to develop a new lesson about. The single most requested topic for me to create a new lesson was power analysis. I agree that this is a crucially important topic. But as important as it is, it is often overlooked or ignored in applied scientists’ statistical training. Those that do know anything about it tend to be folks who work on animals and humans. That’s because oversight committees like IACUC often require a formal power analysis that is subject to approval along with the rest of their research plans. But unfortunately, this is often seen as a pesky annoyance to take care of on the way to doing real science, a bureaucratic requirement to check off a checklist.
My goal in this lesson is not only to teach you the practical skills needed to do a power analysis using R software. It’s also, hopefully, to convince you that power analysis is not just a hurdle to jump to get your study funded. It is a necessary part of the scientific method. Because science is the best method that we humans have come up with so far to learn about the natural world, making sure our scientific studies are properly designed is so important. We can all agree that we want scientific studies to have the highest possible chance of being successful and teaching us something about the world, with the least possible unnecessary waste of time, money, and resources. Doing power analysis before starting a study, whether it’s observational or experimental, is a step toward that goal.
When done properly, power analysis:
If that doesn’t motivate you to learn about power analysis, I don’t know what will!
You may want to refer to my more “big picture” talk on power analysis (video recording on Axon+, PDF of slides).
At the end of this course, you will understand …
At the end of this course, you will be able to …
To be able to understand the results of a power analysis, it’s important that you know what exactly statistical power is, and why we care about it. Here, we’re going to assume we are working in the classical statistical framework: our goal is to use evidence from the data to test a null hypothesis. If we have enough evidence to exceed a pre-defined level of statistical significance, we may reject the null hypothesis; otherwise, we can’t reject it. We are also going to assume there is a “true” value out there in the world somewhere, that we are trying to estimate using statistics. We can never know for sure what that true value is, unless we measure every single individual in the population of interest (and in some cases the population is hypothetical so that wouldn’t even be possible with unlimited resources). So the best we can do is maximize our chance of estimating that value correctly. Sometimes we get it right, and sometimes we get it wrong. That might be because of natural variation, measurement error, unknown biases in our sample, or any of the above . . . the point is that we are either right or wrong, but we are never sure which.
So, we’re considering the black-and-white world where we can either be right or wrong (later in this lesson we will briefly touch on the Bayesian perspective, which treats being right or wrong as more of a continuum). Let’s simplify further and say there is a binary (yes or no) outcome we are trying to estimate. A pregnancy test is a good example: you are either pregnant or you’re not, and the test also returns a binary answer. If you are truly not pregnant and the pregnancy test comes out negative, that’s a “true negative.” If it comes out positive, that’s a “false positive.” If you are truly pregnant and the test is positive, that’s a “true positive.” If negative, it’s a “false negative.” So, true negative and true positive are both correct answers, and false positive and false negative are both errors. You may hear false positive called “Type I error” and false negative called “Type II error.” It’s kind of hard to keep those numbers straight so we will refer to them as false positive and false negative.
The definition of statistical power follows directly from this idea of false positives and false negatives. The power of a statistical test to detect a given phenomenon is the probability the test will detect the phenomenon if the phenomenon is true. In other words it is the probability of observing a true positive. You can only observe a true positive if the null hypothesis is false. This is, in a way, the “opposite” of a p-value. The p-value is essentially the probability of a false positive: the probability that we would observe the data we did if the null hypothesis were true.
We can state it this way:
If a certain effect exists, power is the chance that we actually find the effect in our study.
Obviously, we want to be right as often as possible and wrong as little as possible. It would be great to completely eliminate all false positives and all false negatives. Unfortunately, that is impossible. The only way to be 100% certain that you will never get a false positive is to have a test that gives negative results 100% of the time. A pregnancy test that always says “You are not pregnant” will never result in a false positive! But it is also completely useless. The same goes for eliminating false negatives: the only way to do that with perfect certainty is to give 100% positive results. Obviously, we need some kind of compromise. We need to find the “sweet spot” in the tradeoff between false positives and false negatives.
The traditional paradigm of Western science is to be conservative when rejecting null hypotheses. We want to look like serious scientists who need a lot of evidence before we are persuaded to accept any new phenomenon. Because of this, traditionally we like to strike the balance so that we have a low probability of false positives, but a somewhat higher probability of false negatives. You all know the traditional cutoff of \(p < 0.05\), based on the significance level \(\alpha\). If a statistical test has \(\alpha = 0.05\), it has a 5% probability of a false positive. The 0.05 cutoff, for good or bad, is very widely accepted. But what’s the probability of false negatives? Well, we usually set it at 20%. That’s a 4:1 ratio of the probability of a false negative if the positive is true, to the probability of a false positive if the negative is true. Just like there’s nothing magical about the 0.05 cutoff, there is also nothing special about the 4:1 ratio. We could set it at 1:1, or 1:4, or whatever we want. But we usually target 5% for false positives and 20% for false negatives. End of story. We usually express the power in terms of true positives, and 100% - 20% = 80%. So, 80% power is our “magic number!” (Recall that if the null hypothesis is false, we can either get a true positive or a false negative. Given that the null is false, the probabilities of true positives and false negatives must sum to 100%.)
The above figure illustrates the tradeoff between false positives and false negatives (based on a t-test design with n = 50 per group and a small effect size). Power is 1 minus the false negative rate. You can see that if we reduce our false positive rate to near zero, the false negative rate approaches 100%. In other words, if we are too conservative and refuse to declare even a very small p-value to be significant, we will have almost no power to detect a true effect if it does exist. On the other hand, if we want our power to be high, we have to accept a higher proportion of false positives.
If that wasn’t enough to convince you how important statistical power is, let’s think about what happens when we do a study that is underpowered: a study that has only a low probability of detecting a significant effect, if that effect truly exists in the world. Let’s imagine we do such a study and we get a negative result; we do not declare statistical significance and we can’t reject the null hypothesis. Because the power is low, we don’t know whether it is a true negative or a false negative. In other words, we don’t know if the effect really is zero or nearly zero, or if we only failed to detect it because of natural variation/measurement error/biased sampling/etc. Obviously that is not good. We didn’t add anything to humanity’s knowledge. We might as well have not done the study in the first place and saved the time, money, resources, pain, and suffering.
But what if we do an underpowered study and we do get a positive (statistically significant) result? Well, some people would say, then the study was powerful enough. We only need to care about power if we don’t get a significant result. Unfortunately, they’re wrong. Why? Because at small sample sizes when power is low, we can actually overestimate the effect size because of natural variation and measurement error, just as easily as we can underestimate it. So if the study is underpowered and we see a significant result, we also don’t know if it’s a true positive or a false positive. As power gets lower, even though the rate of false positives is fixed (at let’s say 0.05), the proportion of false positives among the true positives goes up. So if we see a single positive result, we have less chance of knowing if it’s true if the study has low power. This is a big problem because of publication bias, where people are more likely to publish results from underpowered studies if they observe a positive result by chance. The lesson here is that we shouldn’t trust single reports from low-powered studies, whether positive or negative.
But as a counter to everything I just said, low power can be OK. What if your experimental unit costs $1,000,000 each? What about $1,000,000,000? What if you are studying a global phenomenon? After all there’s only one Earth, so we cannot sample from a population of Earths. It is not always wrong to do an underpowered study. It may the only way we can learn about a particular system. You have to be especially precise with your measurements and especially careful to be sure that you have a representative sample because the “costs” of measurement error and sampling bias are that much greater. It also means you have to be very careful about the conclusions you draw, and very explicit about the limitations of your inference.
The common conception of power analysis is that it is a simple linear process. You start with a research question, do a power analysis, and out pops the number of samples/participants/replicates you need. In fact, power analysis should be an iterative, cyclical process. You might have to go back and modify your research question based on the results of a power analysis. You might need to change the sample size based on what is feasible. You might need to reconsider what size of effect you need to detect. And so on. Credit to Jessica Logan for the images above.
If you knew exactly how large the effect is in your study system, you would know exactly how many replicates you would need to detect that effect with adequate statistical power. But if you knew exactly how large the effect is in your study system, you wouldn’t need to do the study in the first place! This means that power calculations are at best rough estimates.
Consider the possible outcomes:
Because the outcome when you collect too few samples is worse than the outcome when you collect too many samples, you should proceed conservatively with power analysis. Erring on the side of collecting too much data is better than too little.
The amount of work you put into a power analysis determines the quality of what you get out of it. As we will see, you can do a decent credible power analysis simply by making up plausible numbers and “selling” them. But the more background research you do, the higher the quality of the power analysis . . . and the higher the quality of the resulting science. This may be hard work!
When a scientist approaches me, the statistician, and says “Do a power analysis for me,” without giving any context, I can give them numbers but they may not be good quality numbers. There is no guarantee that the power numbers I give them will represent the true power of the study. But if the scientist provides information like what magnitude of difference they want to detect, pilot data from their own lab, or relevant studies from the literature, I can give you a much better power analysis that truly reflects your specific question in the context of your specific study system.
Now that you are (hopefully) convinced that power analysis is a good idea, let’s learn how to do it. What numbers do we need to know to do a power analysis? It depends on what our knowns and unknowns are. What are the key quantities that we may or may not know as we begin to design a study?
We’ve already gone into significance level and power, which are related to the false positive and false negative rates, respectively. Let’s talk a little more about sample size and effect size.
When doing statistical inference, we take a sample, calculate some statistics on it, and use those statistics to estimate the parameters (true values) of the population we took the sample from. The laws of statistics tell us that the more samples we take, the closer those statistical estimates get to the true population parameters. If the accuracy of our estimates always kept increasing at the same steady rate, we would never need to do a power analysis. Just collect as much data as you can afford to, and the more data, the more accurate the estimate! But in reality, the rate at which our estimates get more accurate starts to slow down as we collect more and more data. Eventually, we asymptotically approach the true population parameter. That applies to statistical power as well. At some point, the power gets so close to 100%, or alternatively the false negative rate approaches so close to 0%, that we can’t get any more benefit by adding more data.
The above figure illustrates the concept. Again I am showing the same curve for a t-test design with moderate effect size. Now we not only show the tradeoff curve for n = 50 per group, but also n = 100, 150, and 200. As the sample size increases, the curve changes less and less for each additional 50 samples collected. For a given false positive rate, once the false negative rate is near zero, adding more data can’t make it go down any more. So, as we collect more data, we get diminishing returns of how much each new data point contributes to the accuracy of our estimates. This is good because it means we can figure out where along the curve we want to (or can afford to) position ourselves to get the most accurate estimates for the least investment of time and resources. All the power analysis we’ll do today depends on that basic insight.
Effect size is a measure of the magnitude of some effect or relationship between variables. For example, it could be the difference in the mean biomass of the control group and the treatment group. It could be the slope of the relationship between dose of a drug and some physiological response. It could be the ratio of the odds of mortality between two different treatments. We can’t talk about statistical power without knowing something about, or making some assumptions about, effect size. Statistical power is the probability of declaring an effect to be statistically significant, if the true effect is of a specific size. Thus, a particular study design has different power, depending on how big the effect is. The stronger the effect, the easier it is to see that effect above the random noise and variation of the system. That means that statistical power increases as the size of the effect increases.
The above figure shows yet again the power for a t-test design. This time we have n = 50 per group in all three cases, but we have a separate curve for small, medium, and large effect size. You can see that if the effect we’re trying to detect is large, we have it easy. We can set our false positive rate to a very low value, even lower than 0.05 if we want, and we will still have a very low rate of false negatives. Low rate of false negatives = high power. But if the effect we’re trying to detect is small, we will have a very high rate of false negatives if the false positive rate is set to 0.05. In other words, most of the time that the effect is really present in the population, we will not find a statistically significant effect. The only way we can get around this, if the effect truly is small, is to collect more data.
Effect sizes are almost always presented in standardized units. This is because different studies may measure response variables in different units. For example, one study might measure biomass in kilograms and another study might measure height in centimeters. If we convert those response variables to standard deviation units, we can compare results that are otherwise on different scales. If a treatment increases biomass mean by two standard deviations relative to the control, and the same treatment increases height mean by only one standard deviation relative to the control, we can say that the effect size for biomass is twice as big as the one for height.
There are many different effect sizes published in the literature. Choosing which one is appropriate for your study depends on what kind of data you have. Here are some effect sizes you’ll probably want to use in typical situations. There are many more but this is a good starting point.
Effect | Effect size metric | Description | Example |
---|---|---|---|
Difference of the mean of two groups | Cohen’s d | The difference between the two means divided by their combined standard deviation. How many standard deviations greater is one mean than the other? | The effect of a management intervention on biomass yield |
Difference in a binary outcome between two groups | Odds ratio | The ratio of the odds of the outcome happening in one group, to the odds of it happening in another group | The effect of a seed treatment on seed germination rate |
Strength of relationship between two continuous variables | Pearson correlation coefficient r | You should know this one! | The effect of temperature on soil respiration |
Variance explained by treatment group when comparing three or more groups | Cohen’s f | (Square root of) the ratio of variance explained by group, to the residual variance not explained by group | The effect of multiple genotypes and environments on a trait |
Some of the effect sizes are based on the magnitude of the difference between means, and some are based on the proportion of variance explained by the statistical model.
There are a lot of software packages out there to help you do a power analysis. If you use R, the packages I recommend for power analysis are:
WebPower is not only an R package but also an online calculator, and there are quite a few others out there. We’ll use WebPower and simr in this lesson, but I recommend checking the other ones out as well. I became aware of pwr4exp just after finishing these lesson materials, but I really like it and I will incorporate it into future versions of this lesson.
In SAS, you can use proc power and proc glmpower, which are very straightforward. There are also some clever tricks to estimate power for mixed model designs using proc glimmix. My ARS statistician colleagues are working on some training materials for SAS power analysis with glimmix. I will link to them here when they become available.
Before we get into the nitty gritty of actually doing power analysis, I want to make one final point because it addresses a question I often field from scientists.
Reviewers in applied science journals often ask scientists to do a power analysis on a study after the fact, especially if no significant difference was observed. This makes no sense whatsoever. Why? Well, if you do a power analysis assuming the observed difference is the true difference, and you didn’t detect that difference, then you have too low of power. If you did detect the difference, then you have enough power. However, if the sample size is low, we don’t know whether the observed difference is true or not. So doing a power analysis after the fact makes no sense. If the study was underpowered to detect the effect of interest, you already knew that for a fact before you started the study. You already knew the chance that your negative result would be false, given that particular effect size. That’s it and that’s all.
We can also show this from a mathematical standpoint. Post hoc power calculations are completely determined by the p-value. See this blog post by Daniel Lakens with a simulation demonstrating that relationship. That means if we have the p-value we learn nothing additional from the power analysis. So next time you are asked to do a retrospective power analysis on a study you already did, tell ’em no, and tell ’em I sent you!
The first step in our hands-on demonstration of power analysis is also the first step in a real-life power analysis: estimating effect size.
There are several potential ways to get information you can use to estimate effect sizes.
Let’s go through examples of these. As we go along, keep in mind that a typical power analysis won’t result in one single number for effect size or power. You will usually want to construct power curves that cover a range of assumptions.
A good way to get an idea of what effect size you’d like to detect in the study you’re planning is to calculate some effect sizes from data you already have. If you can argue that those datasets are at least roughly comparable to what you’ll get in your proposed study, then you can design that study around that effect size. For example you might have a dataset where your variable of interest is measured in a similar system. Or you might have a dataset where a similar variable is measured in your study system of choice. Both of those might be useful to get a rough estimate of what effect size you should target.
Let’s finally start doing some coding. Load the necessary packages for this lesson. Some of the packages have example data that we will be using, and others actually implement different aspects of power analysis.
library(agridat)
library(effectsize)
library(lme4)
library(emmeans)
library(dplyr)
library(ggplot2)
library(WebPower)
library(simr)
Our first example dataset is from an experiment in beans, the dataset
besag.beans
from the agridat package. The
actual study is more complicated, but for this example just think of it
as measuring yield of a bunch of different bean genotypes.
In this example, we’re going to ignore the blocking structure of the design, and the spatial layout of the plots, to estimate the effect size. That’s actually what I might often do in practice as well. Remember, power analysis is only a ballpark estimate. If our primary research goal were to precisely estimate the mean yields of each bean genotype from this bean yield dataset, we would want to carefully construct a statistical model that reflects the experimental design as well as possible. But our goal here is just to get a crude estimate of the effect size so that we can design another experiment. We’ll likely add a “fudge factor” to the effect size estimate we get from looking at this pilot dataset. As a result, we don’t have to worry too much about building a perfect statistical model. Close is good enough.
Looking at the first few rows of data, we can see there is a factor
column called gen
with genotype ID, and a numeric column
called yield
.
data(besag.beans)
head(besag.beans)
## gen height yield row rep col
## 1 Maris 92 395 1 R0 1
## 2 Maris 90 350 1 R1 2
## 3 Dwarf 37 230 1 R1 3
## 4 Minica 66 355 1 R1 4
## 5 Stella 98 370 1 R1 5
## 6 Topless 66 280 1 R1 6
Fit a simple linear model, and use the anova()
function
to get the sums of squares for the one-way ANOVA of yield by
genotype.
besag_fit <- lm(yield ~ gen, data = besag.beans)
besag_anova <- anova(besag_fit)
We will manually calculate Cohen’s f statistic. It is the square root of the ratio of the treatment sum of squares to the error sum of squares. Here the “treatment” is genotype.
sqrt(besag_anova$`Sum Sq`[1] / besag_anova$`Sum Sq`[2])
## [1] 0.8466414
We don’t have to remember that derivation every time, because the
function cohens_f()
from the effectsize
package will calculate the effect size for each parameter in a linear
model. As you can see we get the same result of 0.85, with a one-sided
95% confidence interval thrown in.
cohens_f(besag_fit)
## # Effect Size for ANOVA
##
## Parameter | Cohen's f | 95% CI
## -----------------------------------
## gen | 0.85 | [0.67, Inf]
##
## - One-sided CIs: upper bound fixed at [Inf].
As I mentioned above, we probably shouldn’t design the study to yield exactly 80% power for f = 0.85. That’s because we just have a single estimate of effect size. The true effect in your future study may be smaller than that one specific estimate. You have a lot of leeway in what effect size to pick. You could probably justify the lower confidence bound of f, which is 0.67. To be more conservative than that, you could design the study to detect an effect about half as large as the effect observed in the previous study (0.4 to 0.45). There is no single correct answer here.
Often, you may not have any of your own pilot data from a similar system to get an estimate of effect size for a power analysis. In that case, the next best thing is to find some studies from the literature that are on a similar topic or done in a similar system. If those studies have published all their raw data (as everyone should be doing by now) you can just proceed as above and calculate the effect size from the data. But if all you have access to are their summary statistics or figures that are published in a paper, you will have to use those to estimate the effect size. By the way, if you’ve ever done a meta-analysis, you will be familiar with trying to extract effect sizes from the literature. This is nothing more than a quick and informal meta-analysis to get a rough range of effect sizes to inform a power analysis.
How many studies is enough? It’s quality over quantity. Two or three studies measuring a response similar to your target response, in a system similar to your study system, are better than many studies that aren’t so closely related. There is no need to exhaustively search the literature. I think it is fine to base your preliminary estimates of effect size even on one single study, as long as you have good justification for saying it is “typical” or representative of what you are expecting to encounter. If you are conservative about your choice of sample size, you are on solid ground.
Here’s an example of how we might go about getting effect sizes from a few studies on a similar topic. Let’s say we are interested in testing, in our particular study system, whether transgenic Bt crops have different abundance of soil organisms than isogenic lines. Even though this is a very well-studied topic, we might want to measure that response in a new context where it hasn’t been measured before. So, because there are so many previous studies of the same or similar response, in similar systems to the one we are planning to do our new study in, we can get a good idea of the range of effect sizes we can expect. We’ll design our study to detect an effect in that range.
In this example, we are looking at the difference of two means. Therefore we will use Cohen’s d as our effect size. This is the standardized mean difference, which is just the difference between the two means, divided by the pooled or “average” standard deviation of the two means. It’s roughly how far apart the two means are, in standard deviation units. The units of Cohen’s d are always the same regardless of what units the original data were collected in. We can gauge the size of an effect for each study and see how they compare.
Imagine that you combed the literature and found published estimates of soil organismal abundance compared between transgenic Bt lines and isogenic lines of the same crop. Here are a few tables where these estimates are published. (I selected these haphazardly from a much longer list that was compiled in a 2022 meta-analysis of the topic, and I’m only showing a small subset of the data in the tables.)
We’re going to make the assumption that the differences these researchers documented are close to the “true” differences. We are going to design our study with this rationale: if the true size of the Bt effect in our study system is roughly close to the observed effect size from similar systems in the past, we will design our study with enough replication to detect an effect size in that range with high (ideally >80%) power.
How do we go from these numbers to effect sizes? The first thing to notice is that some papers present more than one comparison because they measured abundance of multiple taxa on Bt and non-Bt maize. The second thing to notice is that for some of the comparisons, we have means and standard deviation. For others we have means and standard errors of the means. I’ve used the formula \(SE = \frac{SD}{\sqrt{n}}\) to convert the standard errors to standard deviations. We are ignoring any potential issues with distribution of the data, whether or not the studies are properly replicated, and that kind of thing. Because we are only mining these studies for a very rough estimate of effect size, it is not worth it to subject them to that much scrutiny.
We use this formula for Cohen’s d effect size:
\[d = \frac{\bar{x}_1 - \bar{x}_2}{s}\]
Here, \(\bar{x}_1\) is the mean of one group, \(\bar{x}_2\) is the mean of the other group, and \(s\) is the pooled standard deviation, which can be calculated from each group’s standard deviations \(s_1\) and \(s_2\) and their sample sizes \(n_1\) and \(n_2\). This is a way of averaging the two standard deviations, weighted by the sample size of each group, with a correction for small sample sizes:
\[s = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2} {n_1 + n_2 - 2}}\]
Here’s a function I wrote that calculates the Cohen’s d given the two group means, standard deviations, and sizes.
calc_Cohen_d <- function(xbar1, xbar2, s1, s2, n1, n2) {
s <- sqrt(((n1 - 1) * s1^2 + (n2 - 1) * s2^2) / (n1 + n2 - 2))
(xbar1 - xbar2) / s
}
I’ve saved you some work of transcribing the means, standard
deviations, and sample sizes from these tables into a spreadsheet
format. Note you may import the data from a web URL or, if you are on
the Posit Cloud server, from the data
folder.
Bt_means <- read.csv('https://usda-ree-ars.github.io/SEAStats/power_analysis_tutorial/Bt_effect_size_data.csv')
After inspecting the data, use our calc_Cohen_d
function
to calculate Cohen’s d effect size for each row of the data
frame.
Bt_Cohen_d <- with(Bt_means, calc_Cohen_d(xbar1 = Bt_mean, xbar2 = nonBt_mean, s1 = Bt_SD, s2 = nonBt_SD, n1 = Bt_n, n2 = nonBt_n))
First of all, notice that some of the effect sizes are undefined because we had a few variables where all zeros were observed for both groups. If this were our own data, we would worry about that. But again, because power analysis is at best a rough guess, we will just ignore those!
Bt_Cohen_d <- na.omit(Bt_Cohen_d)
We will take the absolute value of the effect sizes (positive means greater mean in Bt, negative means greater mean in isogenic) because we are interested in what size of an effect we can detect with a two-sided test. Make a histogram of the effect sizes, with annotations for the median, 25%, and 75% quantiles.
Bt_d_quantiles <- quantile(abs(Bt_Cohen_d), probs = c(.25, .5, .75))
ggplot(data.frame(d = abs(Bt_Cohen_d)), aes(x = d)) +
geom_histogram(bins = 10) +
geom_vline(xintercept = Bt_d_quantiles[1], color = 'forestgreen', linewidth = 1) +
geom_vline(xintercept = Bt_d_quantiles[2], color = 'darkorange', linewidth = 2) +
geom_vline(xintercept = Bt_d_quantiles[3], color = 'forestgreen', linewidth = 1) +
theme_bw()
Here are the quantiles:
round(Bt_d_quantiles, 3)
## 25% 50% 75%
## 0.266 0.460 0.544
So, what do we use as the effect size to design our study around? Well, that’s a good question. We could design a study that has 80% power to detect an effect size that’s roughly the median of the effect sizes we estimated from our meta-analysis (\(d = 0.46\), which is considered a medium effect size), or to be more conservative, we could target a lower quantile. Keep in mind that this is only a rough guide to study design. Even if you are already locked in to a study design, this kind of meta-analysis can be helpful. For example, if you determined that your study has 80% power to detect effects of \(d \geq 0.3\), you would be happy with these meta-analysis results. About 75% of the effect sizes from the literature were at least 0.3. If they are any indication of what you’re likely to observe, it means you have a good chance that the true difference will be detectable by your study. But if you determined your study can only detect \(d \geq 0.6\) with adequate power, this meta-analysis should make you unhappy. It indicates that if the literature results are anything like your system, you probably won’t have enough power to detect the true difference. Back to the drawing board!
Because you are an expert with a lot of domain knowledge about your study system, you may not need to derive effect sizes from your data or from others’ data. You may have an idea about what effect size, or range of effect sizes, are biologically/practically/economically relevant to detect. If you can justify that effect size, it’s reasonable, and even recommended, to design your study accordingly.
Often you can consult with stakeholders to determine effect size. For example, a stakeholder might say they would invest in a treatment that could increase yield by 2 bushels per acre, or decrease mortality by 25%. The main problem is often figuring out how to translate the stakeholder’s wishes into a standardized effect size metric. You still might need to delve into your own data to estimate the standard deviation needed to convert the effect from the original units into standardized units.
People often tell me they do not have any pilot data, nor could they find any sufficiently relevant studies in the literature to estimate effect size. Usually, I would chalk this up to either unwillingness to do the work of searching the lit, or failure to think creatively enough, or both. To get a representative effect size, you don’t need to find a study that’s exactly the same as yours. It could be the same response variable measured on a different species, or a different variable measured on your same study species, etc. Either way it is close enough to get you in the ballpark.
But if we truly cannot find any reasonable prior data to use as an effect size, we still have to assume one, because power analysis requires an effect size. Even if you know the smallest possible difference you would be interested in detecting, in the units you are going to measure the response in, that still isn’t an effect size because effect sizes are standardized (normalized to be free of units). You might not know what the within-group variance is likely to be, relative to the size of the difference between groups. So we just have to make an assumption and stick with it. What most people do is consult a “benchmark” table for their effect size of choice. For example, here are the benchmarks that Cohen suggested in the original publication of Cohen’s d and f effect sizes from 1988:
d | f | Effect size |
---|---|---|
0.20 | 0.10 | Small |
0.50 | 0.25 | Moderate |
0.80 | 0.40 | Large |
The common practice is to target a “medium” or “moderate” effect size. This is generally a justifiable assumption. Designing a study to detect a small or weak effect requires a large sample size, and weak effects are by definition less important to detect anyway. Designing a study that can only detect large effects seems like a bad idea because it will miss anything that is not large. So medium or moderate it is!
Even though it seems tempting to save yourself a lot of work and use the benchmark effect size value all the time, I don’t recommend that. That’s because the benchmarks are not field-specific. It may not be appropriate to apply them indiscriminately to disparate fields of study. One field’s moderate effect might be another field’s huge effect. This is not a substitute for determining what effect size is practically meaningful through reanalyzing your own data, searching the literature, or consulting with stakeholders.
Now that we have an idea about effect size, let’s move forward with calculating statistical power. For certain simple experimental designs, there are analytical formulas for statistical power. The variables in the formula are:
If we know all but one of those things, we can plug them into our formula and boom, out comes an estimate for the single unknown! For example:
I call this approach “plug and chug” power analysis. It is often a good way to proceed. We’ll go through a few examples together.
Let’s use WebPower to do some power calculations. WebPower has a lot
of different functions that implement formulas for calculating power for
different study designs. We will start out by playing with
wp.anova()
, which calculates power for a one-way ANOVA
design. You can look at the other functions in the package by calling
help(package = 'WebPower')
.
We’ll use wp.anova()
to do the following, assuming we
have an experimental design with four different groups that we are
comparing (k = 4
).
Here, we specify number of groups k = 4
, a moderate
Cohen’s f effect size f = 0.25
, the standard
significance level alpha = 0.05
and 80% power using
power = 0.8
. We specify type = 'overall'
which
will give us the power for the overall F-statistic in the ANOVA.
Note: You may also specify different arguments for
type
to get the power to detect a difference of a given size between two specific groups; see the help documentation.
wp.anova(k = 4, f = 0.25, alpha = 0.05, power = 0.8, type = 'overall')
## Power for One-way ANOVA
##
## k n f alpha power
## 4 178.3971 0.25 0.05 0.8
##
## NOTE: n is the total sample size (overall)
## URL: http://psychstat.org/anova
We see that we need about 178 samples total, meaning 178/4 or ~45 samples per group, to have 80% power to detect a moderate difference between groups.
What if we can only afford 60 total, or 15 per group? How large of an
effect can we detect with 80% power? We no longer specify f
but we do specify n = 60
.
wp.anova(k = 4, n = 60, alpha = 0.05, power = 0.8, type = 'overall')
## Power for One-way ANOVA
##
## k n f alpha power
## 4 60 0.4414577 0.05 0.8
##
## NOTE: n is the total sample size (overall)
## URL: http://psychstat.org/anova
Now we get an estimate of ~0.44 for f
. This means we
only have >80% power to detect fairly large effects with that sample
size.
Finally, let’s say that based on our exhaustive search of the
literature and preliminary analyses of our pilot data, we find that we
can expect effects in about the f = 0.4
range, and we can
afford 60 samples total. How much power will we have? This time we
specify every argument except power
:
wp.anova(k = 4, n = 60, f = 0.4, alpha = 0.05, type = 'overall')
## Power for One-way ANOVA
##
## k n f alpha power
## 4 60 0.4 0.05 0.7087216
##
## NOTE: n is the total sample size (overall)
## URL: http://psychstat.org/anova
We see that we have ~71% power in this case, which is reasonably good.
As we’ve discussed, often sample size and/or effect size are not a single number you know in advance. There may be a range of feasible and justifiable sample sizes you can consider collecting. There may also be a range of biologically/practically/economically relevant effect sizes. To examine how our power or minimum detectable effect size varies over a range of assumptions, we can construct a power curve.
Let’s calculate the power for a number of combinations of different sample sizes and minimum detectable effect sizes, and plot a power curve for each of the effect sizes, going from low sample size to high. Put a dashed line at the “magic number” of 80% power.
The first two lines of code generate all combinations of effect sizes
from 0.1 to 0.6 and number of replicates from 20 to 200 and puts them in
a data frame. In the next line, we call wp.anova()
but
instead of passing a single value to n
and f
,
we pass the columns of the data frame of combinations. Then we extract
only the element called power
from the object returned by
the function and add it as a column of numeric power values ranging
between 0 and 1 to our data frame.
The rest of the code just makes the plot.
ns <- c(20, 40, 60, 80, 100, 120, 140, 160, 180, 200)
n_f_grid <- expand.grid(n = ns, f = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6))
n_f_power <- wp.anova(k = 4, n = n_f_grid$n, f = n_f_grid$f, alpha = 0.05, type = 'overall')
n_f_grid$power <- n_f_power$power
ggplot(n_f_grid, aes(x = n, y = power)) +
geom_line(aes(group = f, color = factor(f)), linewidth = 0.8) +
geom_point(aes(fill = factor(f)), size = 3, shape = 21) +
geom_hline(yintercept = 0.8, linetype = 'dashed', linewidth = 1) +
scale_color_brewer(name = "Cohen's f", palette = 'Reds', aesthetics = c('color', 'fill')) +
scale_x_continuous(breaks = ns) +
theme_bw()
This is an interesting figure. It shows us that we can basically give up on detecting effects \(f = 0.2\) or smaller. Even at large sample sizes, we have low power to detect those weak effects. As for the largest effects, \(f = 0.5\) and greater, we have high power to detect them at even the smallest sample sizes we’re considering. But the scientifically exciting effects are in the middle! Depending on how much time, money, and resources we are willing to allocate, we can target effects in the range of \(f = 0.3\) or \(f = 0.4\).
Flipping the script, let’s calculate the minimum detectable effect
size for a number of different sample sizes, targeting 80% power, and
plot it. Put a dashed line at \(f =
0.25\), the so-called “moderate” effect size. As we’ve
discussed, what is “moderate” depends on your field of study! For
consistency, I will still use the expand.grid()
function,
though it isn’t strictly necessary because we only have one variable to
apply the wp.anova()
function to.
anova_MDES <- expand.grid(n = ns)
anova_MDES$MDES <- map_dbl(anova_MDES$n, ~ wp.anova(k = 4, n = ., alpha = 0.05, power = 0.8, type = 'overall')$f)
ggplot(anova_MDES, aes(x = n, y = MDES)) +
geom_line(linewidth = 0.8) + geom_point(size = 3) +
geom_hline(yintercept = 0.25, linetype = 'dashed', linewidth = 1) +
scale_x_continuous(breaks = ns) +
theme_bw()
We see that at \(n = 20\) total, we only have 80% power to detect any effect \(f = 0.84\) or greater. That’s an underpowered design for any effect that isn’t huge and obvious. This drops to \(f = 0.31\) by the time we get to \(n = 120\). The smaller the MDES, the better! We also see that the curve is flattening, indicating that as we add more reps, we get diminishing returns. Going from 120 to 200 only drops the MDES from 0.31 to 0.24. Ultimately, we can conclude that to detect something in the neighborhood of a moderate effect size, we will need somewhere between 120 to 180 total samples, or 30 to 45 for each of the four groups.
As I’ve said a few times already, power analysis is only a ballpark figure. So there’s no point in going out to too many significant figures when reporting a MDES or power value. These are rough estimates at best. The precise analytical value of \(f = 0.3065215\) we get at \(n = 120\) in the figure above looks nice, but when you consider the thin ice of assumptions it’s based on, the precision is meaningless. I would say something like “about 0.3.”
For many researchers, their experience with power analysis consists entirely of what we’ve done above. Pick an effect size (whether pulled out of thin air or derived from previous data or a meta-analysis of the literature) and plug-and-chug. But this type of quick and easy power analysis is really an anomaly. Most experimental and observational study designs we work with in ag science require mixed models. When you have random effects, there are no longer simple analytical formulas that you can plug numbers into and get power out of. This doesn’t mean we have to give up. It just means we have to do power analysis by simulation.
The basic idea of simulation-based power analysis is fairly simple.
The hard part is coming up with the assumptions.
Let’s start with a simple example. Let’s simulate data from a simple study design where we’re trying to test whether a control group and a treatment group have a different mean. We would use a t-test to test whether the difference is significantly different from zero. Let’s assume that individuals in each group have values that are normally distributed around the group’s mean. We also assume the following:
What is our power? First let’s simulate one dataset. We define the
means, standard deviations, and sample sizes, set a random seed so we
will get the same result every time, and then draw random numbers from
two normal distributions. The function rnorm()
has three
arguments, n
, mean
, and sd
. We do
a t-test to compare the means of those two random samples,
using the function t.test()
with all default options.
mu_t <- 12
mu_c <- 9
sd_t <- 3
sd_c <- 3
n_t <- 30
n_c <- 30
set.seed(1)
x_t <- rnorm(n = n_t, mean = mu_t, sd = sd_t)
x_c <- rnorm(n = n_c, mean = mu_c, sd = sd_c)
t.test(x_t, x_c)
##
## Welch Two Sample t-test
##
## data: x_t and x_c
## t = 4.2663, df = 56.741, p-value = 7.616e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 1.511666 4.186435
## sample estimates:
## mean of x mean of y
## 12.247375 9.398324
Hey, it’s significant at alpha = 0.05! Cool … I guess?
But what’s the power? Remember that power is the probability of
detecting an effect, if the effect exists in reality. We can’t get a
probability out of a single yes-or-no result. We can’t estimate the
probability until we simulate many times. Let’s repeat all of that 10000
times, and save the p-value each time. We do this by creating
p_vals
, a vector of zeros that is 10000 elements long. Then
we use a for
loop, which repeatedly executes the same code
10000 times, increasing the value of i
by 1 each time. Each
time, we generate a new set of random numbers and store the
p-value in element i
of the vector
p_vals
. After the loop is done, we calculate the proportion
of the time that \(p < 0.05\), which
is our power.
n_sim <- 10000
p_vals <- rep(0, n_sim)
for (i in 1:n_sim) {
x_t <- rnorm(n_t, mu_t, sd_t)
x_c <- rnorm(n_c, mu_c, sd_c)
p_vals[i] <- t.test(x_t, x_c)$p.value
}
mean(p_vals < 0.05)
## [1] 0.9682
We have 96.8% power! How does this compare to the analytical result
of wp.t()
from the WebPower package? Remember that Cohen’s
d is the difference between the two means divided by their
pooled standard deviation. Because standard deviation is 2 for both
groups, the pooled SD is 2 as well. So \(d =
(12-9)/3 = 1\). In other words the two means are one standard
deviation apart. This is a “large” effect by Cohen’s benchmarks.
wp.t(n1 = 30, n2 = 30, d = 1, alpha = 0.05)
## Two-sample t-test
##
## n d alpha power
## 30 1 0.05 0.9677083
##
## NOTE: n is number in *each* group
## URL: http://psychstat.org/ttest
We get ~96.8% here too. The simulation-based power analysis agrees with the analytical formula, because we did enough simulations for the empirical power estimate to approach the “true” value. This should be reassuring. But we can’t always check our work so easily against an analytical formula, because they don’t exist for most of the models we use every day in ag science.
Many observational and experimental studies in ag science need to be analyzed with mixed models: statistical models that include random effects to account for groups in the data. But if you have a study design with random effects, you don’t just have to know effect size, sample size, and Type I error rate to calculate power. Models with random effects are also called “multilevel” models because you don’t just have one variance parameter for the residuals. You also have variance of the random effects, which you have to estimate if you want to know the power of your study design. For example, if you have a randomized complete block design (RCBD), there is variance between the different blocks (each block has its own mean) and variance between the individual measurements within a block (this is captured by the residuals of the model). Your statistical power will depend not only on the variation between individuals, but also on how much variation there is between the blocks. All else being equal, if the variance between blocks increases, your power will decrease.
If you have a blocked design or otherwise require random effects for your statistical model, you can’t do a power analysis without having some idea, or making some assumptions, about how different the blocks are from each other. And if you have a complex design like a split-split-plot or an RCBD with repeated measures in time, you will have more than one random effect variance component that needs to be accounted for!
What do we do? Well, this should make it even more obvious to you that there is no real substitute for getting good pilot data from the literature or your own previous work. If you have done similarly designed studies in your own system in the past, you can just get the variance components from those old datasets and use them in your simulation. To get that information from the literature, you’ll probably need to find published datasets. Thankfully, as the momentum toward open science grows, this is becoming easier and easier.
Again, there are no easy formulas for calculating power for mixed-model designs. We have no choice but to do power analysis by simulation. Let’s go through a few examples of how we might design an experiment, using simulation-based power analysis to determine the optimal number of samples.
Let’s say we are planning a multi-environment wheat yield trial and
we want to know how many replicates to include in our randomized
complete block design. We will imagine we have some previous data: we’ll
use an example dataset from the agridat package,
linder.wheat
. It is a wheat yield trial replicated in seven
environments. At each environment, a randomized complete block design
was planted, with four replicates of nine genotypes.
Let’s fit a linear mixed-effects model to the data with genotype
gen
, environment env
, and their interaction
gen:env
as fixed effects, a random intercept for block
nested within environment, and yield
as the response
variable.
data(linder.wheat)
linder_lmm <- lmer(yield ~ gen + env + gen:env + (1 | block:env), data = linder.wheat)
Now, we can take a look at the relevant parameters of the model.
joint_tests()
gives us the analysis of variance table with
F-tests, which shows us that environment explains much more variation
than genotype or G by E interaction. VarCorr()
gives us the
variance components, one for block and one for the residual variance.
The variance due to block is substantial.
joint_tests(linder_lmm)
## model term df1 df2 F.ratio p.value
## gen 8 168 2.119 0.0366
## env 6 21 43.917 <.0001
## gen:env 48 168 2.837 <.0001
VarCorr(linder_lmm)
## Groups Name Std.Dev.
## block:env (Intercept) 22.059
## Residual 37.176
Now let’s say we wanted to determine how much power we have to detect a genotype effect in the model, if we varied the number of replicates (blocks) per environment. We can do this by using the statistical model we fit already as the basis for a simulation. Using the simr package, we can simulate random effects for new datasets with different numbers of blocks. We do this many times for each number of blocks we want to test. Then, we perform our statistical test of choice on each of those simulated datasets and use this to calculate power. Let’s say we wanted to estimate power for every number of blocks from 3 through 8. We want to estimate the power for the F-test for genotype that we saw in the ANOVA table just now.
First we use the function extend()
to create a new model
object that can be used to simulate. We use the argument
along = 'block'
to specify that we want to add new levels
to the block
variable, and n = 8
to say we
want 8 levels of block. This new object doesn’t contain new data, but it
can be used to simulate data with the same structure as the original
data, but with variable numbers of blocks all the way up to 8.
Next, we use the powerCurve()
function to do the
simulation. We will do this for 10 simulations per number of blocks as a
toy example (nsim = 10
). Again, we specify
along = 'block'
to tell the simulation to simulate datasets
with different number of levels of block
. The
test
argument says we’re testing the fixed effect of
genotype using the F-test. We set a random seed so that we all get the
same results.
linder_lmm_extended <- extend(linder_lmm, along = 'block', n = 8)
linder_pcurve <- powerCurve(linder_lmm_extended,
along = 'block',
test = fixed('gen', method = 'f'),
nsim = 10,
seed = 3)
We can display the calculated power numbers as well as their 95% confidence intervals, either as text output or with a default plot.
linder_pcurve
## Power for predictor 'gen', (95% confidence interval),
## by number of levels in block:
## 3: 80.00% (44.39, 97.48) - 189 rows
## 4: 90.00% (55.50, 99.75) - 252 rows
## 5: 100.0% (69.15, 100.0) - 315 rows
## 6: 100.0% (69.15, 100.0) - 378 rows
## 7: 100.0% (69.15, 100.0) - 441 rows
## 8: 100.0% (69.15, 100.0) - 504 rows
##
## Time elapsed: 0 h 0 m 30 s
plot(linder_pcurve)
I would not recommend doing only 10 simulations per sample size in practice. Here is what it looks like if you do 1000 simulations per sample size (this takes about half an hour on my laptop, though this could be sped up with some tricks to run the simulations in parallel).
linder_pcurve_1000 <- powerCurve(linder_lmm_extended,
along = 'block',
test = fixed('gen', method = 'f'),
nsim = 1000,
seed = 3)
plot(linder_pcurve_1000)
Doing more simulations gives us narrower confidence intervals on our power estimates. We see that three blocks is probably too few, and we still get a substantial boost in power going from four to five. Above five, we reach the point of diminishing returns.
Bonus exercise: try substituting
'gen:env'
for'gen'
in the code to estimate power for the G by E interaction effect.
Here, we’ll imagine we are interested in testing cattle for some kind
of disease. We are going to sample herds of cattle and test them for the
disease, but we don’t know how many herds to test or how many cattle per
herd. For this, we are going to use another example dataset. This is the
cbpp
dataset from the lme4 package. It
gives us the incidence of a respiratory disease of zebu cattle in
Ethiopia. There are 15 different herds with variable numbers of cattle,
each of which were sampled at four different time periods. At each time
period, the number of cattle with and without symptoms of the CBPP
disease were counted. The research question is whether there are
differences in the proportion of cattle with CBPP between time periods.
The mean probabilities for each herd are considered to be drawn from a
normal distribution (herd has a random intercept) but the effect of time
period is considered to be the same in each herd (there is no random
slope term).
First, we need to convert the data to long form so that each row is
an individual animal at a single time point, with 0
in the
disease
column if it is uninfected and 1
if it
is infected.
data(cbpp)
cbpp <- cbpp %>%
group_by(herd, period) %>%
reframe(animal = 1:n(),
disease = rep(c(0, 1), c(size - incidence, incidence)))
Next, we fit a binomial generalized linear mixed model (GLMM) to the data, because the response variable is binary (0s and 1s).
cbpp_glmm <- glmer(disease ~ period + (1 | herd), data = cbpp, family = binomial)
We will extend the model in two different ways. First, we will extend
the number of herds up to 20. The second extension uses the
within
argument instead of along
. Now instead
of increasing the number of herds, we are increasing the number of
cattle in each combination of period and herd, all the way up to 50.
Note that if you want to simulate increasing numbers of samples within
combinations of grouping variables, you need to separate the different
group names with a +
as we do here:
'period+herd'
.
cbpp_glmm_n_herds <- extend(cbpp_glmm, along = 'herd', n = 20)
cbpp_glmm_n_within_herd <- extend(cbpp_glmm, within = 'period+herd', n = 50)
Do two simulations, one to test the effect of increasing the number
of herds, and the other to test the effect of increasing the number of
cattle within herds, keeping the number of herds fixed at the original
15. In both cases we are using the Chi-squared test for the fixed effect
of time period. For the second simulation, I have specified specific
values of cattle number to test using breaks
. In previous
simulations we have used the default breaks
which are no
greater than 10 evenly spaced values going up to the maximum.
cbpp_pcurve_n_herds <- powerCurve(cbpp_glmm_n_herds,
along = 'herd',
test = fixed('period', method = 'chisq'),
nsim = 10,
seed = 4)
cbpp_pcurve_n_within_herd <- powerCurve(cbpp_glmm_n_within_herd,
within = 'period+herd',
test = fixed('period', method = 'chisq'),
nsim = 10,
seed = 5,
breaks = c(3, 5, 10, 15, 25, 50))
Take a look at the power curves.
plot(cbpp_pcurve_n_herds)
plot(cbpp_pcurve_n_within_herd)
Again, this gives the basic idea but the 10 simulations per number of levels is not really adequate. Here are the results with 1000 simulations (don’t run this code unless you want to sit there for an hour or two).
cbpp_pcurve_n_herds_1000 <- powerCurve(cbpp_glmm_n_herds,
along = 'herd',
test = fixed('period', method = 'chisq'),
nsim = 1000,
seed = 4)
cbpp_pcurve_n_within_herd_1000 <- powerCurve(cbpp_glmm_n_within_herd,
within = 'period+herd',
test = fixed('period', method = 'chisq'),
nsim = 1000,
seed = 5,
breaks = c(3, 5, 10, 15, 25, 50))
plot(cbpp_pcurve_n_herds_1000)
plot(cbpp_pcurve_n_within_herd_1000)
Here, we see that statistical power increases at the expected diminishing rate with increased number of herds. I also wanted to demonstrate that the number of cattle per herd is less important than the number of herds. We see that for the fixed number of 15 herds, we quickly reach maximum power at all but the very smallest number of cattle per herd. This illustrates the point that it is usually more cost-effective to increase the number of replicates/blocks/herds/groups, than the number of individuals within the replicate or group. The replicates or groups are independent experimental units. Getting more measurements within each of those groups improves the precision of the estimate for each group but doesn’t increase the number of independent samples. It is the number of independent samples that really determines power.
Our hands-on demonstration of power analysis is now complete. Before concluding, here is a quick note on Bayesian power analysis. If you know me, you know I am a strong proponent of Bayesian methods. But I have shown you classical or frequentist power analysis methods here. If you do go the Bayesian route, can you still do power analysis? The answer is yes, but it’s different. I know many of you aren’t into Bayesian analysis, so I will not go into great detail here.
In some sense, power analysis does not make sense in a Bayesian context because the goal of pure Bayesian analysis is not to reject a null hypothesis, it is to update our belief about a parameter. We are just looking to estimate the posterior distribution of an effect size of interest, not to say whether it’s significantly different from zero or any other arbitrary number. Within that framework, concepts like true positive and false positive don’t have any meaning.
But obviously, regardless of what statistical methodology we’re using, we want to design studies with a high probability of giving us the answer we want, with a reasonable use of time, money, and resources. Whether Bayesian or not, this requires making an informed decision about sample size ahead of time. For Bayesian power analysis, we can base our decision off of what kind of estimate we want for the posterior distribution. For example we could base it off of what we want the width of the 95% credible interval of the posterior distribution of the parameter estimate to be.
And a second point is that many people do indeed use Bayesian analysis to ask frequentist-style questions and test hypotheses. If you have done any of my Bayesian workshops, you will remember us talking about “maximum a posteriori p-values.” Those are nothing more than Bayesian null hypothesis tests. So if you are willing to test a null hypothesis using a Bayesian model, then you can meaningfully talk about power. It is no problem to do a simulation-based power analysis, in that framework. However it can be computationally intensive. We won’t go through any examples of that today, but I would be willing to work with scientists individually to do a custom Bayesian power analysis.
Today we’ve learned some important big-picture ideas about power analysis:
We’ve also learned some practical skills for doing power analysis using R:
I will leave you with some final recommendations.
First, I’ve said this already today in a lot of different ways, but there is no magic shortcut in power analysis. The more effort you put into a power analysis, the better. The more informed it is by literature, the better results you will get.
Another important point to keep in mind is that you have to design your study around the weakest effect you want to be able to detect. You might have adequate power to detect differences in some of the variables you’re measuring, but not others. Of course, you may not always know everything you are going to measure ahead of time. It’s fine to decide mid-study that you want to measure some new responses, but you still have to consider power in that case.
I want to also briefly address something that people ask me about a lot, which is how to do power analysis for omics and other multivariate datasets. A lot of multivariate analyses that we do are more about describing variation than testing a specific hypothesis, so it’s kind of hard to do power analysis. Again, having previous data helps a lot. You could use previous data to get a rough idea of the variation you expect to see when you collect your new data, or as a basis for simulation. You could also base the power of your design on detecting differences in just one of the many response variables in the ’omics dataset (e.g., changes in the abundance of one metabolite in a metabolomics dataset), with appropriate correction for false discovery rate. There are many ways to go about it, any of which you could potentially justify with some quantitative determination of power. In the end, you will have to “sell” your justification for sample size not only to reviewers but also to yourself. Whether it’s omics or anything else, the question you should be asking is, how confident will I be in the results of this study?
I’ll end with this: Power analysis seems like a bureaucratic requirement or only something that statisticians care about. I beg you not to think of power analysis as a box you have to check. In fact, it is a critical part of the scientific method. After all, power means getting reliable answers to the scientific and practical questions you care about. No power = no new knowledge = no science.
Thank you.
Here are a few optional exercises if you would like to practice what you’ve learned about power analysis.
Calculate the Cohen’s d effect size given the following:
Hint: Use the calc_Cohen_d()
function defined
earlier in this lesson.
Use the function wp.correlation()
from
WebPower to calculate the statistical power to detect
correlations of 0.2, 0.4, and 0.6 at a sample size of 50, with
significance level 0.05 and a two-sided test. All other default options
should not be changed. You may need to look at the function
documentation by typing ?wp.correlation
.
In this exercise, we will do a simulation-based power-analysis using
the yates.oats
dataset from agridat. These
are the results of a split-plot field trial with 3 different genotypes
(gen
) in the main plots and 4 different levels of manure
nitrogen (nitro
) in the subplots. There are 6 replicated
blocks. First, convert nitro
to a factor variable and fit a
linear mixed model with genotype, nitrogen, and their interaction as
fixed effects, and random intercepts for block
and for main
plot (gen:block
).
data(yates.oats)
yates.oats$nitro <- factor(yates.oats$nitro)
yates_lmm <- lmer(yield ~ gen + nitro + gen:nitro + (1 | block) + (1 | gen:block), data = yates.oats)
'gen:nitro'
should be used as the name of
the fixed effect you are testing.