Everything you ever wanted to know about means comparisons but were afraid to ask

Quentin D. Read

Who is this talk for?

Some practical experience with statistics is assumed (but not too much!)

The concepts discussed are independent of what software platform you prefer
- A few examples are provided in R and SAS code, but those aren’t the focus of this talk

Informal poll

Who has done a “means separation” or “pairwise means comparison” or “post hoc test” before?

In many fields across agricultural science, we want to compare mean values among groups in experimental or observational studies
This talk is all about comparing means, and common issues that arise in the process

Estimation of marginal means

Some call them least-square means (inspired by lsmeans statement in SAS)

We can call them modeled means, or predicted means

I like to call them marginal means because they are the means in the “margins” of the table, averaging over all other fixed and random effects in the model

We also estimate the confidence intervals. For mixed models, an approximation of degrees of freedom is calculated

The eyeball test

Are the means of treatments A and B significantly different?
What about A and C? B and C?

ANSWER: Maybe, and maybe not!

“If two means’ error bars overlap, they aren’t significantly different.”

NOPE! 95% confidence intervals may overlap even if the 95% confidence interval of the difference is significantly different from zero

The opposite is also true

You must test whether the difference between each pair of means is significantly different from zero!

Significant differences from zero

The 95% confidence interval of effect size of treatment A does not contain 0, and the 95% confidence interval of effect size of treatment B contains 0.
Are these effect sizes significantly different?

ANSWER: Maybe, and maybe not!

“If A is significantly different from zero, and B is not significantly different from zero, then A is significantly different from B.”

NOPE! Zero does not have a “privileged” status when comparing two means

One confidence interval may contain zero and the other not, but the two means might not be significantly different from each other

You must test whether the difference between each pair of means is significantly different from zero!

What are means comparisons or means separations?

Basically, they are t-tests comparing means two by two

Take the difference by subtracting one mean from the other

The p-value is calculated for the null hypothesis that the difference is equal to zero, using the degrees of freedom of the estimated marginal means

Then the p-value and confidence interval of the difference are adjusted to account for multiple comparisons

The tests differ in how degrees of freedom are estimated and how the p-value is adjusted for multiple comparisons

Before we get into specifics, let’s talk about tradeoffs!

False positives (saying there is a difference when there’s not) and false negatives (failing to find a true difference) are both bad

This is especially an issue when you are doing lots of comparisons

If you reduce the rate of false positives, you must increase the rate of false negatives, and vice versa. You can’t avoid this tradeoff!

If the multiple comparison correction is not very strict, you have high power to detect differences, but are you “dredging” or “fishing” in your data?

But if the correction is too strict, your test is too weak and you might be missing important and valuable information

Multiple comparisons

The \(p < 0.05\) threshold means the probability is <0.05 of observing a difference between means in your sample due to random sampling variation at least as big as you observed, if the null hypothesis were true that the difference between the two populations’ means is zero

The more pairs of means you compare, the more likely you are to incorrectly reject one or more of the null hypotheses of no difference

We adjust the p-value threshold for multiple comparisons

Planned and unplanned comparisons

Planned selecting only a subset of the possible comparisons
Unplanned doing all the comparisons

Lots of ways to adjust for multiple comparisons!

Bonferroni
Šidák
Tukey HSD
Holm
FDR (Benjamini-Hochberg)
and more . . .

Simple adjustments

Bonferroni: If you have \(m\) comparisons, test each one at significance level \(\alpha_{adjusted}=\frac{\alpha}{m}\)
- example: \(m=10\), \(\alpha=0.05\). Use threshold \(\alpha_{adjusted}=0.05/10=0.005\) to reject null hypotheses

Šidák: If you have \(m\) comparisons, test each one at significance level \(\alpha_{adjusted}=1-(1-\alpha)^{\frac{1}{m}}\)
- example: \(m=10\), \(\alpha=0.05\). Use threshold \(\alpha_{adjusted}=1-(1-0.05)^{\frac{1}{10}}\approx0.0051\) to reject null hypotheses

Stepdown adjustments

Holm (a.k.a. stepdown or sequential Bonferroni): If you have \(m\) comparisons, sort in order of increasing p-value and test each one at significance level \(\frac{\alpha}{m}, \frac{\alpha}{m-1}, \frac{\alpha}{m-2}, ..., \frac{\alpha}{2}, \frac{\alpha}{1}\).

Go through the comparisons in order. If you fail to reject a null hypothesis, stop.
- example: \(m=10\), \(\alpha=0.05\). Thresholds are \(0.05/10=0.005, 0.05/9\approx0.0056\), etc.

The stepdown technique can be applied to other adjustments besides Bonferroni

More complex procedures

Tukey HSD: compares each standardized pairwise difference to a distribution of differences

FDR (false discovery rate) adjustment: most common is Benjamini-Hochberg but there are others; these methods scale the p-values so that the rate of false discoveries stays at \(\alpha\)

Slightly less strict (higher power to detect differences, but higher rate of false positives) than the simpler methods

I won’t go into the math here

Other methods: historically interesting but no longer recommended

Fisher’s LSD: Pairwise t-tests using ANOVA’s mean squared error as their variance estimates

Scheffé method: Adjust confidence intervals of all possible contrasts, not just pairwise

Student-Newman-Keuls (SNK) method: Stepwise procedure sorting pairwise differences by size and comparing from biggest difference to smallest

I don’t recommend these because they don’t adequately correct for multiple comparisons

Subsets of comparisons

Compare every other mean to the control group
- Dunnett adjustment is often used here

Consecutive (sort the means and compare adjacent means)

Compare every other mean to the lowest or highest mean

Or choose a subset of scientifically interesting comparisons

It’s all good, as long as you properly correct for multiple comparisons!

ANOVA versus post hoc test

It is a common misconception that you can only compare means if you get a significant F-test in your ANOVA … NOPE!

This is only true for Scheffé procedure and Fisher LSD

Otherwise, the multiple comparison correction is equivalent to, and actually even stronger than, the “two step” procedure

Means comparison doesn’t have to be a “two step” process

“OMG my ANOVA and post hoc test don’t match!”

Post hoc and ANOVA do not match 1

Post hoc and ANOVA do not match 2

Post hoc and ANOVA do not match 3

“OMG my ANOVA and post hoc test don’t match!”

This is perfectly normal

There are two different null hypotheses being tested, that are not guaranteed to give the same result

ANOVA’s null hypothesis: the ratio of variance among groups:variance within groups is not higher than we would expect if there’s no difference
- This is a so-called omnibus F-test

Post hoc test’s null hypothesis: the difference between this specific pair of means is zero

The post hoc test is typically adjusted for multiple comparisons but the ANOVA has no such adjustment

More often than not the ANOVA will be below the significance threshold and the post hoc test will not, but the opposite is also possible

This only occurs in “borderline” cases; a mismatch should make you skeptical that you’ve discovered a very strong pattern

“But what do I do if they don’t match?”

Be honest about it!
Example: “The F-test indicated significant among-treatment variation in vibranium production, but post hoc comparison using the Tukey multiple comparison adjustment did not identify any pair of treatments that significantly differed from one another.”

Skip the ANOVA

It is actually OK to skip the omnibus F-test and go straight to the means comparisons

If the biological hypothesis you are testing is just about the differences between means, there is no need to present the F-test results

Historically, you couldn’t skip the ANOVA if you were calculating everything by hand because the sums of squares used to calculate the F-ratios are also used for the post hoc tests

But if you are using a computer, feel free to skip the ANOVA!

Alphabet soup

Compact Letter Display (CLD)

A concise way to summarize comparisons

Hans-Peter Piepho developed an algorithm in SAS, originally a macro and now part of the lsmeans statement, which is also implemented in R (multcomp::cld())

This should not be done if there are too many groups

It may still be important to present magnitude and test statistics for individual pairwise differences

Means comparisons with transformed response variables and GLMMs

If the response variable is on the log scale, or you have a GLMM with a log link function, the comparisons are ratios and not differences

Start with a difference of two logs:

\[\log a - \log b\]

Back-transform to linear scale by exponentiating and you get a ratio:

\[e^{\log a - \log b} = \frac{e^{\log a}}{e^{\log b}} = \frac{a}{b}\]

Example: “Mean aggression in the electroshock treatment group was 2.54 times higher than in the acupuncture treatment group (95% CI [2.1, 3.0], Tukey-adjusted p = 0.003) and 6.67 times higher than in the untreated control (95% CI [5.95, 7.23], p < 1 × 10^-5).”

Always inverse transform your means and confidence intervals — they’re much easier to understand that way!

Back-transforming in R and SAS
- R: emmeans(model, pairwise ~ treatment, type = 'response')
- SAS: lsmeans treatment / diff ilink

Means comparisons in binomial GLMMs

In binomial GLMMs, the means are probabilities

We often use a logit link function so the means are on the log-odds scale

The means comparisons are odds ratios, because the difference between two log-odds is an odds ratio when you back-transform to the probability scale

Same logic as previous but I won’t show the math

Example: “Mean probability of developing Read’s Disease in Genotype A was significantly higher than in Genotype B (OR 5.3, 95% CI [4.3, 6.3], Sidak-adjusted p = 0.0001) but was not different from Genotype C (OR = 0.85, 95% CI [0.55, 1.15], p = 0.46).”

Comparing interaction means

Another common misconception: if you have interactions, you must compare every combination to every single other combination … NOPE!

This may be way too many comparisons, most of which are scientifically uninteresting

For example if you have 2 levels of treatment A, and 4 of treatment B, that is 8 combinations. That would be 7+6+5+4+3+2+1 = 28 pairwise comparisons!
- Instead, you could do pairwise comparisons of treatment A within each level of treatment B (4 total comparisons)
- Or you could compare treatment B within each level of treatment A (6 comparisons each, 12 total)
- It depends which is more scientifically relevant

Can be easily done in R and SAS
- R: emmeans(model, pairwise ~ A | B, adjust = 'tukey')
- SAS: slice A*B / sliceby=B diff adjust=tukey

Bayesian means comparison

We may use Bayesian methods instead

We take the posterior distribution of each estimated marginal mean, and take the difference

We find the median and credible interval of these differences to assess the strength of evidence for an effect

There are methods to calculate analogues to p-values for each of the pairwise comparisons
- Bayes Factors (BF), maximum a posteriori probabilities, probability based on region of practical equivalence (ROPE)

Multiple comparisons in Bayesian analysis

Post hoc multiple comparison correction is not needed if sufficiently skeptical priors are used

Assigning fairly high prior probability to null or very small effect works similarly to adjusting p-value for multiple comparisons; they both make it less likely for you to say there is a big effect when the real effect is small

But some people “double up” and adjust BFs for multiple comparisons; I don’t recommend this

Means comparisons don’t scale

If you have a whole lot of things you’re comparing, many of these tests become too conservative

With lots of comparisons, it’s unreasonable to try to commit zero false positive errors

Use an FDR correction: the goal is to keep the false discovery rate (rate of false positives) to a small percentage

But it’s important to keep in mind this is still intended for modest numbers of comparisons

You may want to move to a machine learning approach

Everything you ever wanted to know about means comparisons but were afraid to ask

Who is this talk for?

Informal poll

Estimation of marginal means

The eyeball test

Significant differences from zero

What are means comparisons or means separations?

Before we get into specifics, let’s talk about tradeoffs!

Multiple comparisons

Planned and unplanned comparisons

Lots of ways to adjust for multiple comparisons!

Simple adjustments

Stepdown adjustments

More complex procedures

Other methods: historically interesting but no longer recommended

Subsets of comparisons

ANOVA versus post hoc test

“OMG my ANOVA and post hoc test don’t match!”

“OMG my ANOVA and post hoc test don’t match!”

“But what do I do if they don’t match?”

Skip the ANOVA

Alphabet soup

Means comparisons with transformed response variables and GLMMs

Means comparisons in binomial GLMMs

Comparing interaction means

Bayesian means comparison

Multiple comparisons in Bayesian analysis

Means comparisons don’t scale

Questions?