Bayesian p-values?!?!?

Quentin D. Read

Bayesian hypothesis testing

Effect size estimation with uncertainty is really all that is needed when reporting a Bayesian analysis
- Measure of central tendency of the posterior (median, mean, mode)
- Characterize uncertainty (credible intervals, either equal-tailed using quantiles, or highest-density interval)
But it is often desirable to explicitly test a null hypothesis

Bayesian p-value analogs

\(p_d\): Probability of direction
\(BF\): Bayes Factor
\(p_{MAP}\): Maximum a posteriori (MAP) Bayesian p-value
\(p_{ROPE}\): Region of practical equivalence (ROPE) Bayesian p-value

bayestestR package

Dominique Makowski, Mattan Ben-Shachar, and colleagues developed this package
All the Bayesian p-values, effect sizes, and hypothesis tests you could want are in there
- Link to methods paper comparing the four indices
- Link to package

Probability of direction

Take the empirical posterior distribution (MCMC samples) and use quantiles from it to find the posterior probability
Example: to test a one-sided null hypothesis that a slope parameter is <= 0, we find the proportion of the posterior that is > 0.
We can say “there is 0.999 posterior probability that the slope is > 0” or to correspond with the one-sided frequentist p-value “there is 0.001 posterior probability that the slope is <= 0.”
p_direction() in bayestestR

Relationship with one-sided and two-sided p-value:

\[p_{one-sided} = 1 - p_d\]

\[p_{two-sided} = 2(1 - p_d)\]

Fairly simple and robust
Limited to testing “point” null hypotheses
Requires adequate sampling of tails of the posterior

Bayes Factor (BF)

Ratio of evidence for A:evidence for B
In this case, it would be the ratio of evidence for posterior distribution of a parameter or estimate from the model, to the evidence for the prior distribution of that value
If you can interpret the prior as a null hypothesis, then the Bayes Factor is measuring the ratio of evidence for the alternative:evidence for the null

Bayes Factor: interpretation

BF = 1: Our belief about the parameter’s value didn’t change at all after fitting the model to data.
BF < 1: We are even more confident that the prior distribution is true after fitting the model to data.
BF > 1: There is more evidence that the posterior is true, than the prior.

Bayes Factor: null region

If we define a null region, for example \([-0.5, 0.5]\) then BF is the ratio of the posterior odds that the parameter is outside the null region:prior odds that the parameter is outside the null region

\[BF = \frac{\text{posterior odds}}{\text{prior odds}} = \frac{\frac{1 - P(null|data)}{P(null|data)}}{\frac{1 - P(null)}{P(null)}}\]

Bayes Factor: point null hypothesis

If we define a point null hypothesis, then BF is the ratio of the prior density at the point null value:posterior density at the point null value

\[BF = \frac{f(null)}{f(null|data)}\]

Jeffreys’ BF thresholds (1961)

BF > 3: There is evidence. The data are at least 3 times more likely under the alternative hypothesis compared to the null, or prior.
BF > 10: There is “strong” evidence.
Other sets of cutoffs exist (Kass & Raftery 1995)
Is BF > 10 just the new p < 0.05? BF-hacking is just as easy as p-hacking!

BF in bayestestR

Use bayesfactor_parameters() to get BFs for any parameter in the model
But we can calculate BF for literally any derived quantity from the model
- Sample the posterior, then “unupdate” the model to sample from priors
- Then do whatever calculation you want on both the prior and posterior
- Pass both prior and posterior as arguments to the BF function

Effect of priors on BF

Sensitivity to prior is a major drawback of BF
Even if the posterior is completely unaffected by your prior, the prior will still strongly affect BF
BF also depends very strongly on the tails of the posterior distribution
We often get only ~5000 posterior samples, which is plenty for the median, but 10 times that or more may be needed to characterize the tails
Some suggest calculating BF for a range of priors, which may be computationally prohibitive, and is hard to interpret anyway

Maximum a posteriori (MAP) p-value

\[p_{MAP} = \frac{f(null|data)}{f(mode|data)}\]

Generate a smoothed kernel density estimate of the posterior
Find the mode of this distribution. That’s the MAP estimate of the parameter
Pick a null value, such as 0 for a difference or 1 for a ratio
Evaluate the kernel density function (KDF) at the null value and the MAP value, and take the ratio
Lower = stronger evidence against the null, just like a p-value
p_map() in bayestestR

Not as sensitive to the prior
But it is sensitive to the kernel density estimation function
Also requires a lot of posterior samples if the null value is near the tail of the posterior distribution

Region of practical equivalence (ROPE) p-value

Define an effect size or difference that is considered practically/biologically equivalent
- Example: \([-0.1, +0.1]\)
Calculate the proportion of the posterior mass that lies inside the ROPE
- Some people use proportion of the 95% or 89% highest density interval lying inside the ROPE
Again, lower = stronger evidence against the parameter being inside the ROPE, just like a p-value
p_rope() in bayestestR

Ideal because it is not at all sensitive to the prior, nor does it need tons of samples to characterize the tails
Requires you to explicitly define a practically equivalent effect size so obviously it’s very sensitive to what you choose
- But is this a drawback or actually an advantage?
I would recommend at least listing the values for two or three ROPEs

Pros and cons

Ones that “quack like a duck,” meaning look like a p-value, are familiar to people. You can even use \(\alpha = 0.05\) as a threshold if you want
- But is that good or bad?
\(p_D\) is limited to point null hypothesis
\(BF\) is very sensitive to the choice of prior
\(p_{MAP}\) is sensitive to the method used to estimate the kernel density function (this is also true if you use the point-based \(BF\))
\(p_{ROPE}\), and \(BF\) if it uses a null region, are sensitive to the choice of the null region
Note all methods require you to sample the tails of the posterior pretty well, some more than others

Order constraints

We can make a priori assumption that the direction of the effect is positive, if we don’t want to consider negative effects
Truncate the posterior distribution to only retain values >0 before further calculations
This will alter the Bayes Factor and \(p_{ROPE}\) to make them more “powerful”
- blog post by Mattan Ben-Shachar

Bayesian hypothesis tests and multiple comparisons

“But what about multiple comparisons???”
Some (Gelman) argue that sufficiently skeptical priors, i.e. priors with a high proportion of the mass at small effect size, is the Bayesian way of accounting for multiple comparisons. No additional correction necessary!
Others argue that this is not strong enough correction. They “double dip” and adjust BFs or Bayesian p-values for multiple comparisons, using FWER/FDR corrections or homegrown methods