Everything you ever wanted to know about means comparisons but were afraid to ask

Quentin D. Read

Who is this talk for?

  • Some practical experience with statistics is assumed (but not too much!)
  • The concepts discussed are independent of what software platform you prefer
    • A few examples are provided in R and SAS code, but those aren’t the focus of this talk

Informal poll

Who has done a “means separation” or “pairwise means comparison” or “post hoc test” before?

  • In many fields across agricultural science, we want to compare mean values among groups in experimental or observational studies
  • This talk is all about comparing means, and common issues that arise in the process

Estimation of marginal means

  • Some call them least-square means (inspired by lsmeans statement in SAS)
  • We can call them modeled means, or predicted means
  • I like to call them marginal means because they are the means in the “margins” of the table, averaging over all other fixed and random effects in the model
  • We also estimate the confidence intervals. For mixed models, an approximation of degrees of freedom is calculated

The eyeball test

  • Are the means of treatments A and B significantly different?
  • What about A and C? B and C?
  • ANSWER: Maybe, and maybe not!
  • “If two means’ error bars overlap, they aren’t significantly different.”
  • NOPE! 95% confidence intervals may overlap even if the 95% confidence interval of the difference is significantly different from zero
  • The opposite is also true
  • You must test whether the difference between each pair of means is significantly different from zero!

Significant differences from zero

  • The 95% confidence interval of effect size of treatment A does not contain 0, and the 95% confidence interval of effect size of treatment B contains 0.
  • Are these effect sizes significantly different?
  • ANSWER: Maybe, and maybe not!
  • “If A is significantly different from zero, and B is not significantly different from zero, then A is significantly different from B.”
  • NOPE! Zero does not have a “privileged” status when comparing two means
  • One confidence interval may contain zero and the other not, but the two means might not be significantly different from each other
  • You must test whether the difference between each pair of means is significantly different from zero!

What are means comparisons or means separations?

  • Basically, they are t-tests comparing means two by two
  • Take the difference by subtracting one mean from the other
  • The p-value is calculated for the null hypothesis that the difference is equal to zero, using the degrees of freedom of the estimated marginal means
  • Then the p-value and confidence interval of the difference are adjusted to account for multiple comparisons
  • The tests differ in how degrees of freedom are estimated and how the p-value is adjusted for multiple comparisons

Before we get into specifics, let’s talk about tradeoffs!

  • False positives (saying there is a difference when there’s not) and false negatives (failing to find a true difference) are both bad
  • This is especially an issue when you are doing lots of comparisons
  • If you reduce the rate of false positives, you must increase the rate of false negatives, and vice versa. You can’t avoid this tradeoff!
  • If the multiple comparison correction is not very strict, you have high power to detect differences, but are you “dredging” or “fishing” in your data?
  • But if the correction is too strict, your test is too weak and you might be missing important and valuable information

Multiple comparisons

  • The \(p < 0.05\) threshold means the probability is <0.05 of observing a difference between means in your sample due to random sampling variation at least as big as you observed, if the null hypothesis were true that the difference between the two populations’ means is zero
  • The more pairs of means you compare, the more likely you are to incorrectly reject one or more of the null hypotheses of no difference
  • We adjust the p-value threshold for multiple comparisons

Planned and unplanned comparisons

  • Planned selecting only a subset of the possible comparisons
  • Unplanned doing all the comparisons

Lots of ways to adjust for multiple comparisons!

  • Bonferroni
  • Šidák
  • Tukey HSD
  • Holm
  • FDR (Benjamini-Hochberg)
  • and more . . .

Simple adjustments

  • Bonferroni: If you have \(m\) comparisons, test each one at significance level \(\alpha_{adjusted}=\frac{\alpha}{m}\)
    • example: \(m=10\), \(\alpha=0.05\). Use threshold \(\alpha_{adjusted}=0.05/10=0.005\) to reject null hypotheses
  • Šidák: If you have \(m\) comparisons, test each one at significance level \(\alpha_{adjusted}=1-(1-\alpha)^{\frac{1}{m}}\)
    • example: \(m=10\), \(\alpha=0.05\). Use threshold \(\alpha_{adjusted}=1-(1-0.05)^{\frac{1}{10}}\approx0.0051\) to reject null hypotheses

Stepdown adjustments

  • Holm (a.k.a. stepdown or sequential Bonferroni): If you have \(m\) comparisons, sort in order of increasing p-value and test each one at significance level \(\frac{\alpha}{m}, \frac{\alpha}{m-1}, \frac{\alpha}{m-2}, ..., \frac{\alpha}{2}, \frac{\alpha}{1}\).
  • Go through the comparisons in order. If you fail to reject a null hypothesis, stop.
    • example: \(m=10\), \(\alpha=0.05\). Thresholds are \(0.05/10=0.005, 0.05/9\approx0.0056\), etc.
  • The stepdown technique can be applied to other adjustments besides Bonferroni

More complex procedures

  • Tukey HSD: compares each standardized pairwise difference to a distribution of differences
  • FDR (false discovery rate) adjustment: most common is Benjamini-Hochberg but there are others; these methods scale the p-values so that the rate of false discoveries stays at \(\alpha\)
  • Slightly less strict (higher power to detect differences, but higher rate of false positives) than the simpler methods
  • I won’t go into the math here

Subsets of comparisons

  • Compare every other mean to the control group
    • Dunnett adjustment is often used here
  • Consecutive (sort the means and compare adjacent means)
  • Compare every other mean to the lowest or highest mean
  • Or choose a subset of scientifically interesting comparisons
  • It’s all good, as long as you properly correct for multiple comparisons!

ANOVA versus post hoc test

  • It is a common misconception that you can only compare means if you get a significant F-test in your ANOVA … NOPE!
  • This is only true for Scheffé procedure and Fisher LSD
  • Otherwise, the multiple comparison correction is equivalent to, and actually even stronger than, the “two step” procedure

Means comparison doesn’t have to be a “two step” process

“OMG my ANOVA and post hoc test don’t match!”

Post hoc and ANOVA do not match 1

Post hoc and ANOVA do not match 2

Post hoc and ANOVA do not match 3

“OMG my ANOVA and post hoc test don’t match!”

  • This is perfectly normal
  • There are two different null hypotheses being tested, that are not guaranteed to give the same result
  • ANOVA’s null hypothesis: the ratio of variance among groups:variance within groups is not higher than we would expect if there’s no difference
    • This is a so-called omnibus F-test
  • Post hoc test’s null hypothesis: the difference between this specific pair of means is zero
  • The post hoc test is typically adjusted for multiple comparisons but the ANOVA has no such adjustment
  • More often than not the ANOVA will be below the significance threshold and the post hoc test will not, but the opposite is also possible
  • This only occurs in “borderline” cases; a mismatch should make you skeptical that you’ve discovered a very strong pattern

“But what do I do if they don’t match?”

  • Be honest about it!
  • Example: “The F-test indicated significant among-treatment variation in vibranium production, but post hoc comparison using the Tukey multiple comparison adjustment did not identify any pair of treatments that significantly differed from one another.”

Skip the ANOVA

  • It is actually OK to skip the omnibus F-test and go straight to the means comparisons
  • If the biological hypothesis you are testing is just about the differences between means, there is no need to present the F-test results
  • Historically, you couldn’t skip the ANOVA if you were calculating everything by hand because the sums of squares used to calculate the F-ratios are also used for the post hoc tests
  • But if you are using a computer, feel free to skip the ANOVA!

Alphabet soup

  • Compact Letter Display (CLD)
  • A concise way to summarize comparisons
  • Hans-Peter Piepho developed an algorithm in SAS, originally a macro and now part of the lsmeans statement, which is also implemented in R (multcomp::cld())
  • This should not be done if there are too many groups
  • It may still be important to present magnitude and test statistics for individual pairwise differences

Means comparisons with transformed response variables and GLMMs

  • If the response variable is on the log scale, or you have a GLMM with a log link function, the comparisons are ratios and not differences
  • Start with a difference of two logs:

\[\log a - \log b\]

  • Back-transform to linear scale by exponentiating and you get a ratio:

\[e^{\log a - \log b} = \frac{e^{\log a}}{e^{\log b}} = \frac{a}{b}\]

  • Example: “Mean aggression in the electroshock treatment group was 2.54 times higher than in the acupuncture treatment group (95% CI [2.1, 3.0], Tukey-adjusted p = 0.003) and 6.67 times higher than in the untreated control (95% CI [5.95, 7.23], p < 1 × 10-5).”
  • Always inverse transform your means and confidence intervals — they’re much easier to understand that way!
  • Back-transforming in R and SAS
    • R: emmeans(model, pairwise ~ treatment, type = 'response')
    • SAS: lsmeans treatment / diff ilink

Means comparisons in binomial GLMMs

  • In binomial GLMMs, the means are probabilities
  • We often use a logit link function so the means are on the log-odds scale
  • The means comparisons are odds ratios, because the difference between two log-odds is an odds ratio when you back-transform to the probability scale
  • Same logic as previous but I won’t show the math
  • Example: “Mean probability of developing Read’s Disease in Genotype A was significantly higher than in Genotype B (OR 5.3, 95% CI [4.3, 6.3], Sidak-adjusted p = 0.0001) but was not different from Genotype C (OR = 0.85, 95% CI [0.55, 1.15], p = 0.46).”

Comparing interaction means

  • Another common misconception: if you have interactions, you must compare every combination to every single other combination … NOPE!
  • This may be way too many comparisons, most of which are scientifically uninteresting
  • For example if you have 2 levels of treatment A, and 4 of treatment B, that is 8 combinations. That would be 7+6+5+4+3+2+1 = 28 pairwise comparisons!
    • Instead, you could do pairwise comparisons of treatment A within each level of treatment B (4 total comparisons)
    • Or you could compare treatment B within each level of treatment A (6 comparisons each, 12 total)
    • It depends which is more scientifically relevant
  • Can be easily done in R and SAS
    • R: emmeans(model, pairwise ~ A | B, adjust = 'tukey')
    • SAS: slice A*B / sliceby=B diff adjust=tukey

Bayesian means comparison

  • We may use Bayesian methods instead
  • We take the posterior distribution of each estimated marginal mean, and take the difference
  • We find the median and credible interval of these differences to assess the strength of evidence for an effect
  • There are methods to calculate analogues to p-values for each of the pairwise comparisons
    • Bayes Factors (BF), maximum a posteriori probabilities, probability based on region of practical equivalence (ROPE)

Multiple comparisons in Bayesian analysis

  • Post hoc multiple comparison correction is not needed if sufficiently skeptical priors are used
  • Assigning fairly high prior probability to null or very small effect works similarly to adjusting p-value for multiple comparisons; they both make it less likely for you to say there is a big effect when the real effect is small
  • But some people “double up” and adjust BFs for multiple comparisons; I don’t recommend this

Means comparisons don’t scale

  • If you have a whole lot of things you’re comparing, many of these tests become too conservative
  • With lots of comparisons, it’s unreasonable to try to commit zero false positive errors
  • Use an FDR correction: the goal is to keep the false discovery rate (rate of false positives) to a small percentage
  • But it’s important to keep in mind this is still intended for modest numbers of comparisons
  • You may want to move to a machine learning approach

Questions?