Bayesian p-values?!?!?

Quentin D. Read

Bayesian hypothesis testing

  • Effect size estimation with uncertainty is really all that is needed when reporting a Bayesian analysis
    • Measure of central tendency of the posterior (median, mean, mode)
    • Characterize uncertainty (credible intervals, either equal-tailed using quantiles, or highest-density interval)
  • But it is often desirable to explicitly test a null hypothesis

Bayesian p-value analogs

  • \(p_d\): Probability of direction
  • \(BF\): Bayes Factor
  • \(p_{MAP}\): Maximum a posteriori (MAP) Bayesian p-value
  • \(p_{ROPE}\): Region of practical equivalence (ROPE) Bayesian p-value

bayestestR package

Probability of direction

  • Take the empirical posterior distribution (MCMC samples) and use quantiles from it to find the posterior probability
  • Example: to test a one-sided null hypothesis that a slope parameter is <= 0, we find the proportion of the posterior that is > 0.
  • We can say “there is 0.999 posterior probability that the slope is > 0” or to correspond with the one-sided frequentist p-value “there is 0.001 posterior probability that the slope is <= 0.”
  • p_direction() in bayestestR
  • Relationship with one-sided and two-sided p-value:

\[p_{one-sided} = 1 - p_d\]

\[p_{two-sided} = 2(1 - p_d)\]

  • Fairly simple and robust
  • Limited to testing “point” null hypotheses
  • Requires adequate sampling of tails of the posterior

Bayes Factor (BF)

  • Ratio of evidence for A:evidence for B
  • In this case, it would be the ratio of evidence for posterior distribution of a parameter or estimate from the model, to the evidence for the prior distribution of that value
  • If you can interpret the prior as a null hypothesis, then the Bayes Factor is measuring the ratio of evidence for the alternative:evidence for the null

Bayes Factor: interpretation

  • BF = 1: Our belief about the parameter’s value didn’t change at all after fitting the model to data.
  • BF < 1: We are even more confident that the prior distribution is true after fitting the model to data.
  • BF > 1: There is more evidence that the posterior is true, than the prior.

Bayes Factor: null region

  • If we define a null region, for example \([-0.5, 0.5]\) then BF is the ratio of the posterior odds that the parameter is outside the null region:prior odds that the parameter is outside the null region

\[BF = \frac{\text{posterior odds}}{\text{prior odds}} = \frac{\frac{1 - P(null|data)}{P(null|data)}}{\frac{1 - P(null)}{P(null)}}\]

Bayes Factor: point null hypothesis

  • If we define a point null hypothesis, then BF is the ratio of the prior density at the point null value:posterior density at the point null value

\[BF = \frac{f(null)}{f(null|data)}\]

Jeffreys’ BF thresholds (1961)

  • BF > 3: There is evidence. The data are at least 3 times more likely under the alternative hypothesis compared to the null, or prior.
  • BF > 10: There is “strong” evidence.
  • Other sets of cutoffs exist (Kass & Raftery 1995)
  • Is BF > 10 just the new p < 0.05? BF-hacking is just as easy as p-hacking!

BF in bayestestR

  • Use bayesfactor_parameters() to get BFs for any parameter in the model
  • But we can calculate BF for literally any derived quantity from the model
    • Sample the posterior, then “unupdate” the model to sample from priors
    • Then do whatever calculation you want on both the prior and posterior
    • Pass both prior and posterior as arguments to the BF function

Effect of priors on BF

  • Sensitivity to prior is a major drawback of BF
  • Even if the posterior is completely unaffected by your prior, the prior will still strongly affect BF
  • BF also depends very strongly on the tails of the posterior distribution
  • We often get only ~5000 posterior samples, which is plenty for the median, but 10 times that or more may be needed to characterize the tails
  • Some suggest calculating BF for a range of priors, which may be computationally prohibitive, and is hard to interpret anyway

Maximum a posteriori (MAP) p-value

\[p_{MAP} = \frac{f(null|data)}{f(mode|data)}\]

  • Generate a smoothed kernel density estimate of the posterior
  • Find the mode of this distribution. That’s the MAP estimate of the parameter
  • Pick a null value, such as 0 for a difference or 1 for a ratio
  • Evaluate the kernel density function (KDF) at the null value and the MAP value, and take the ratio
  • Lower = stronger evidence against the null, just like a p-value
  • p_map() in bayestestR
  • Not as sensitive to the prior
  • But it is sensitive to the kernel density estimation function
  • Also requires a lot of posterior samples if the null value is near the tail of the posterior distribution

Region of practical equivalence (ROPE) p-value

  • Define an effect size or difference that is considered practically/biologically equivalent
    • Example: \([-0.1, +0.1]\)
  • Calculate the proportion of the posterior mass that lies inside the ROPE
    • Some people use proportion of the 95% or 89% highest density interval lying inside the ROPE
  • Again, lower = stronger evidence against the parameter being inside the ROPE, just like a p-value
  • p_rope() in bayestestR
  • Ideal because it is not at all sensitive to the prior, nor does it need tons of samples to characterize the tails
  • Requires you to explicitly define a practically equivalent effect size so obviously it’s very sensitive to what you choose
    • But is this a drawback or actually an advantage?
  • I would recommend at least listing the values for two or three ROPEs

Pros and cons

  • Ones that “quack like a duck,” meaning look like a p-value, are familiar to people. You can even use \(\alpha = 0.05\) as a threshold if you want
    • But is that good or bad?
  • \(p_D\) is limited to point null hypothesis
  • \(BF\) is very sensitive to the choice of prior
  • \(p_{MAP}\) is sensitive to the method used to estimate the kernel density function (this is also true if you use the point-based \(BF\))
  • \(p_{ROPE}\), and \(BF\) if it uses a null region, are sensitive to the choice of the null region
  • Note all methods require you to sample the tails of the posterior pretty well, some more than others

Order constraints

  • We can make a priori assumption that the direction of the effect is positive, if we don’t want to consider negative effects
  • Truncate the posterior distribution to only retain values >0 before further calculations
  • This will alter the Bayes Factor and \(p_{ROPE}\) to make them more “powerful”

Bayesian hypothesis tests and multiple comparisons

  • “But what about multiple comparisons???”
  • Some (Gelman) argue that sufficiently skeptical priors, i.e. priors with a high proportion of the mass at small effect size, is the Bayesian way of accounting for multiple comparisons. No additional correction necessary!
  • Others argue that this is not strong enough correction. They “double dip” and adjust BFs or Bayesian p-values for multiple comparisons, using FWER/FDR corrections or homegrown methods

Any questions?