Bayes' Formula

Essential
Last updated: Tags: Probability, Conditional Probability

Prerequisites

The multiplication rule gives P(AB)=P(AB)P(B)=P(BA)P(A)P(A \cap B) = P(A \mid B) P(B) = P(B \mid A) P(A). Bayes’ formula exploits this symmetry to invert a conditional probability: given how likely observed data BB is under each hypothesis AiA_i, it computes how likely each hypothesis is given the data.

Setup

Let {A1,A2,}\{A_1, A_2, \ldots\} be a countable partition of Ω\Omega with each P(Ai)>0P(A_i) > 0. Think of the AiA_i as competing hypotheses. You know:

  • The prior probabilities P(Ai)P(A_i) — your uncertainty before observing BB.
  • The likelihoods P(BAi)P(B \mid A_i) — how probable the observation BB is under each hypothesis.

Bayes’ formula computes the posterior probabilities P(AiB)P(A_i \mid B) — your updated uncertainty after observing BB.

Derivation

Applying the multiplication rule to AiBA_i \cap B in two ways:

P(BAi)P(Ai)=P(AiB)=P(AiB)P(B).P(B \mid A_i) \, P(A_i) = P(A_i \cap B) = P(A_i \mid B) \, P(B).

Solving for P(AiB)P(A_i \mid B) and substituting the law of total probability for P(B)P(B):

P(AiB)=P(BAi)P(Ai)P(B)=P(BAi)P(Ai)jP(BAj)P(Aj).(1)P(A_i \mid B) = \frac{P(B \mid A_i) \, P(A_i)}{P(B)} = \frac{P(B \mid A_i) \, P(A_i)}{\displaystyle\sum_j P(B \mid A_j) \, P(A_j)}. \tag{1}

This is Bayes’ formula. The denominator is the normalising constant that makes the posteriors P(AiB)P(A_i \mid B) sum to 11.

Bayesian interpretation

Equation (1)(1) is often read as the proportionality

posterior    likelihood×prior,\text{posterior} \;\propto\; \text{likelihood} \times \text{prior},

or: to update from prior to posterior, multiply each hypothesis’s prior probability by how well it predicts the observed data BB, then renormalise. The denominator P(B)P(B) — called the marginal likelihood or evidence — is the same for all hypotheses, so it affects only the scale.

This proportionality is the foundation of Bayesian inference: a prior belief over a parameter, multiplied by the likelihood of observed data under that parameter, yields a posterior belief. As data accumulate, the posterior concentrates around the true parameter regardless of the prior (under regularity conditions).

The diagnostic test example

A disease affects 1% of a population. A diagnostic test has:

  • Sensitivity P(+D)=0.99P(+ \mid D) = 0.99 (true positive rate).
  • Specificity P(Dc)=0.99P(- \mid D^c) = 0.99, i.e.\ false positive rate P(+Dc)=0.01P(+ \mid D^c) = 0.01.

A randomly chosen person tests positive. What is the probability they actually have the disease?

Applying Bayes’ formula with {D,Dc}\{D, D^c\} as the partition:

P(D+)=P(+D)P(D)P(+D)P(D)+P(+Dc)P(Dc)=0.99×0.010.99×0.01+0.01×0.99=0.00990.0198=50%.P(D \mid +) = \frac{P(+ \mid D) \, P(D)}{P(+ \mid D) \, P(D) + P(+ \mid D^c) \, P(D^c)} = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = \frac{0.0099}{0.0198} = 50\%.

A 99%-accurate test gives only a 50% posterior for a disease with 1% prevalence. The reason: true positives (0.99%1%0.99\% \approx 1\% of the population) and false positives (0.01×99%1%0.01 \times 99\% \approx 1\%) are equally common when the disease is rare, so the prior P(D)=0.01P(D) = 0.01 cancels. The base rate carries enormous weight when it is far from 50%.

Raising the prevalence to 10% (a high-risk sub-population):

P(D+)=0.99×0.100.99×0.10+0.01×0.90=0.0990.10891.7%.P(D \mid +) = \frac{0.99 \times 0.10}{0.99 \times 0.10 + 0.01 \times 0.90} = \frac{0.099}{0.108} \approx 91.7\%.

The same test is far more informative in the high-risk group because the prior is less extreme.

Summary

  • Bayes’ formula: P(AiB)=P(BAi)P(Ai)/jP(BAj)P(Aj)P(A_i \mid B) = P(B \mid A_i) P(A_i) \,/\, \sum_j P(B \mid A_j) P(A_j) — derived from the multiplication rule applied symmetrically and the law of total probability.
  • Bayesian read: posterior \propto likelihood ×\times prior; the denominator P(B)P(B) is a normalising constant shared by all hypotheses.
  • Base-rate sensitivity: a highly accurate test can still give a modest posterior when the prior (prevalence) is very small — the key lesson of the diagnostic example.