Everything you ever wanted to know about means comparisons but were afraid to ask

Quentin D. Read

Who is this talk for?

  • Some practical experience with statistics is assumed (but not too much!)
  • The concepts discussed are independent of what software platform you prefer
    • A few examples are provided in R and SAS code, but those aren’t the focus of this talk

Informal poll

Who has done a “means separation” or “pairwise means comparison” or “post hoc test” before?

  • In many fields across agricultural science, we want to compare mean values among groups in experimental or observational studies
  • This talk is all about comparing means, and common issues that arise in the process

Estimation of marginal means

  • Some call them least-square means (inspired by lsmeans statement in SAS)
  • We can call them modeled means, or predicted means
  • I like to call them marginal means because they are the means in the “margins” of the table, averaging over all other fixed and random effects in the model
  • We also estimate the confidence intervals. For mixed models, an approximation of degrees of freedom is calculated

The eyeball test

  • Are the means of treatments A and B significantly different?
  • What about A and C? B and C?
  • ANSWER: Maybe, and maybe not!
  • “If two means’ error bars overlap, they aren’t significantly different.”
  • NOPE! 95% confidence intervals may overlap even if the 95% confidence interval of the difference is significantly different from zero
  • The opposite is also true
  • You must test whether the difference between each pair of means is significantly different from zero!

Significant differences from zero

  • The 95% confidence interval of effect size of treatment A does not contain 0, and the 95% confidence interval of effect size of treatment B contains 0.
  • Are these effect sizes significantly different?
  • ANSWER: Maybe, and maybe not!
  • “If A is significantly different from zero, and B is not significantly different from zero, then A is significantly different from B.”
  • NOPE! Zero does not have a “privileged” status when comparing two means
  • One confidence interval may contain zero and the other not, but the two means might not be significantly different from each other
  • You must test whether the difference between each pair of means is significantly different from zero!

What are means comparisons or means separations?

  • Basically, they are t-tests comparing means two by two
  • Take the difference by subtracting one mean from the other
  • The p-value is calculated for the null hypothesis that the difference is equal to zero, using the degrees of freedom of the estimated marginal means
  • Then the p-value and confidence interval of the difference are adjusted to account for multiple comparisons
  • The tests differ in how degrees of freedom are estimated and how the p-value is adjusted for multiple comparisons

Before we get into specifics, let’s talk about tradeoffs!

  • False positives (saying there is a difference when there’s not) and false negatives (failing to find a true difference) are both bad
  • This is especially an issue when you are doing lots of comparisons
  • If you reduce the rate of false positives, you must increase the rate of false negatives, and vice versa. You can’t avoid this tradeoff!
  • If the multiple comparison correction is not very strict, you have high power to detect differences, but are you “dredging” or “fishing” in your data?
  • But if the correction is too strict, your test is too weak and you might be missing important and valuable information

Multiple comparisons

  • The \(p < 0.05\) threshold means the probability is <0.05 of observing a difference between means in your sample due to random sampling variation at least as big as you observed, if the null hypothesis were true that the difference between the two populations’ means is zero
  • The more pairs of means you compare, the more likely you are to incorrectly reject one or more of the null hypotheses of no difference
  • We adjust the p-value threshold for multiple comparisons

Planned and unplanned comparisons

  • Planned selecting only a subset of the possible comparisons
  • Unplanned doing all the comparisons

Lots of ways to adjust for multiple comparisons!

  • Bonferroni
  • Šidák
  • Tukey HSD
  • Holm
  • FDR (Benjamini-Hochberg)
  • and more . . .

Simple adjustments

  • Bonferroni: If you have \(m\) comparisons, test each one at significance level \(\alpha_{adjusted}=\frac{\alpha}{m}\)
    • example: \(m=10\), \(\alpha=0.05\). Use threshold \(\alpha_{adjusted}=0.05/10=0.005\) to reject null hypotheses
  • Šidák: If you have \(m\) comparisons, test each one at significance level \(\alpha_{adjusted}=1-(1-\alpha)^{\frac{1}{m}}\)
    • example: \(m=10\), \(\alpha=0.05\). Use threshold \(\alpha_{adjusted}=1-(1-0.05)^{\frac{1}{10}}\approx0.0051\) to reject null hypotheses

Stepdown adjustments

  • Holm (a.k.a. stepdown or sequential Bonferroni): If you have \(m\) comparisons, sort in order of increasing p-value and test each one at significance level \(\frac{\alpha}{m}, \frac{\alpha}{m-1}, \frac{\alpha}{m-2}, ..., \frac{\alpha}{2}, \frac{\alpha}{1}\).
  • Go through the comparisons in order. If you fail to reject a null hypothesis, stop.
    • example: \(m=10\), \(\alpha=0.05\). Thresholds are \(0.05/10=0.005, 0.05/9\approx0.0056\), etc.
  • The stepdown technique can be applied to other adjustments besides Bonferroni

More complex procedures

  • Tukey HSD: compares each standardized pairwise difference to a distribution of differences
  • FDR (false discovery rate) adjustment: most common is Benjamini-Hochberg but there are others; these methods scale the p-values so that the rate of false discoveries stays at \(\alpha\)
  • Slightly less strict (higher power to detect differences, but higher rate of false positives) than the simpler methods
  • I won’t go into the math here

Subsets of comparisons

  • Compare every other mean to the control group
    • Dunnett adjustment is often used here
  • Consecutive (sort the means and compare adjacent means)
  • Compare every other mean to the lowest or highest mean
  • Or choose a subset of scientifically interesting comparisons
  • It’s all good, as long as you properly correct for multiple comparisons!

ANOVA versus post hoc test

  • It is a common misconception that you can only compare means if you get a significant F-test in your ANOVA … NOPE!
  • This is only true for Scheffé procedure and Fisher LSD
  • Otherwise, the multiple comparison correction is equivalent to, and actually even stronger than, the “two step” procedure

Means comparison doesn’t have to be a “two step” process

“OMG my ANOVA and post hoc test don’t match!”

Post hoc and ANOVA do not match 1

Post hoc and ANOVA do not match 2

Post hoc and ANOVA do not match 3

“OMG my ANOVA and post hoc test don’t match!”

  • This is perfectly normal
  • There are two different null hypotheses being tested, that are not guaranteed to give the same result
  • ANOVA’s null hypothesis: the ratio of variance among groups:variance within groups is not higher than we would expect if there’s no difference
    • This is a so-called omnibus F-test
  • Post hoc test’s null hypothesis: the difference between this specific pair of means is zero
  • The post hoc test is typically adjusted for multiple comparisons but the ANOVA has no such adjustment
  • More often than not the ANOVA will be below the significance threshold and the post hoc test will not, but the opposite is also possible
  • This only occurs in “borderline” cases; a mismatch should make you skeptical that you’ve discovered a very strong pattern

“But what do I do if they don’t match?”

  • Be honest about it!
  • Example: “The F-test indicated significant among-treatment variation in vibranium production, but post hoc comparison using the Tukey multiple comparison adjustment did not identify any pair of treatments that significantly differed from one another.”

Skip the ANOVA

  • It is actually OK to skip the omnibus F-test and go straight to the means comparisons
  • If the biological hypothesis you are testing is just about the differences between means, there is no need to present the F-test results
  • Historically, you couldn’t skip the ANOVA if you were calculating everything by hand because the sums of squares used to calculate the F-ratios are also used for the post hoc tests
  • But if you are using a computer, feel free to skip the ANOVA!