The Probability Axioms

Essential
Last updated: Tags: Probability, Measure Theory

Now that you have the vocabulary of sample spaces and events, the next step is to assign numbers to events in a coherent way. Kolmogorov’s 1933 axioms do exactly this: they are the minimal conditions that make probability a useful calculus of uncertainty.

The three axioms

Definition. A probability space is a triple (Ω,F,P)(\Omega, \mathcal{F}, P) where Ω\Omega is the sample space, F\mathcal{F} is a σ-algebra of events on Ω\Omega, and P:FRP : \mathcal{F} \to \mathbb{R} is a function satisfying:

  1. Non-negativity. P(A)0P(A) \geq 0 for all AFA \in \mathcal{F}.
  2. Normalisation. P(Ω)=1P(\Omega) = 1.
  3. Countable additivity (σ\sigma-additivity). For any sequence of pairwise disjoint events A1,A2,FA_1, A_2, \ldots \in \mathcal{F},
P ⁣(n=1An)=n=1P(An).P\!\left(\bigcup_{n=1}^{\infty} A_n\right) = \sum_{n=1}^{\infty} P(A_n).

Such a function PP is called a probability measure. Notice that a probability space is simply a measure space in which the measure has total mass 11: P(Ω)=1P(\Omega) = 1. Every theorem about measures applies directly to probability.

Immediate consequences

The three axioms alone imply a rich set of properties.

Probability of the impossible event

Taking the empty disjoint union with An=A_n = \emptyset for all nn:

P()=P ⁣(n=1)=n=1P(),P(\emptyset) = P\!\left(\bigcup_{n=1}^{\infty} \emptyset\right) = \sum_{n=1}^{\infty} P(\emptyset),

which forces P()=0P(\emptyset) = 0.

Finite additivity

For two disjoint events AA and BB, set A1=AA_1 = A, A2=BA_2 = B, An=A_n = \emptyset for n3n \geq 3. Then σ\sigma-additivity gives P(AB)=P(A)+P(B)P(A \cup B) = P(A) + P(B).

Complement rule

Since AA and AcA^c are disjoint and AAc=ΩA \cup A^c = \Omega:

P(Ac)=1P(A).P(A^c) = 1 - P(A).

Monotonicity

If ABA \subseteq B, write B=A(BA)B = A \cup (B \setminus A) as a disjoint union. Then

P(B)=P(A)+P(BA)P(A).P(B) = P(A) + P(B \setminus A) \geq P(A).

In particular P(A)P(Ω)=1P(A) \leq P(\Omega) = 1 for every event AA.

Inclusion–exclusion

For two events:

P(AB)=P(A)+P(B)P(AB).(1)P(A \cup B) = P(A) + P(B) - P(A \cap B). \tag{1}

Proof. Write AB=A(BA)A \cup B = A \cup (B \setminus A) disjointly, so P(AB)=P(A)+P(BA)P(A \cup B) = P(A) + P(B \setminus A). Write B=(AB)(BA)B = (A \cap B) \cup (B \setminus A) disjointly, so P(BA)=P(B)P(AB)P(B \setminus A) = P(B) - P(A \cap B). Combine.

The generalisation to nn events is the inclusion–exclusion principle:

P ⁣(k=1nAk)=kP(Ak)k<P(AkA)+k<<mP(AkAAm)+(1)n+1P(A1An).P\!\left(\bigcup_{k=1}^n A_k\right) = \sum_k P(A_k) - \sum_{k < \ell} P(A_k \cap A_\ell) + \sum_{k < \ell < m} P(A_k \cap A_\ell \cap A_m) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n).

Union bound (Boole’s inequality)

A simpler upper bound drops the intersection terms:

P ⁣(n=1An)n=1P(An).P\!\left(\bigcup_{n=1}^{\infty} A_n\right) \leq \sum_{n=1}^{\infty} P(A_n).

This follows from inclusion–exclusion by discarding the negative terms. It is invaluable in asymptotic arguments: if you can show nP(An)0\sum_n P(A_n) \to 0, you know P(An)0P(\bigcup A_n) \to 0 as well.

Continuity of probability

Probability is continuous with respect to monotone limits of events, just as the Lebesgue integral is continuous with respect to monotone limits of functions.

Continuity from below

If A1A2A_1 \subseteq A_2 \subseteq \cdots is an increasing sequence of events with n=1An=A\bigcup_{n=1}^\infty A_n = A, then

P(An)    P(A)as n.P(A_n) \;\nearrow\; P(A) \quad \text{as } n \to \infty.

Proof. Write A=A1(A2A1)(A3A2)A = A_1 \cup (A_2 \setminus A_1) \cup (A_3 \setminus A_2) \cup \cdots as a disjoint union of “rings”. By σ\sigma-additivity:

P(A)=P(A1)+k=1P(Ak+1Ak)=limnP(An).P(A) = P(A_1) + \sum_{k=1}^\infty P(A_{k+1} \setminus A_k) = \lim_{n\to\infty} P(A_n).

Continuity from above

If B1B2B_1 \supseteq B_2 \supseteq \cdots is a decreasing sequence with n=1Bn=B\bigcap_{n=1}^\infty B_n = B, then

P(Bn)    P(B)as n,P(B_n) \;\searrow\; P(B) \quad \text{as } n \to \infty,

provided P(B1)<P(B_1) < \infty (which here is automatic since P(B1)1P(B_1) \leq 1). The proof applies continuity from below to the complements.

These continuity properties are essential for proving theorems about limits: the Borel–Cantelli lemmas, the strong law of large numbers, and many results in stochastic processes all rely on them.

Probability is measure theory

A probability measure is a measure with total mass 11. This means the entire machinery of the Lebesgue integral — linearity, monotone convergence, dominated convergence, Fatou’s lemma, Fubini’s theorem — carries over intact. When you write E[X]=ΩXdPE[X] = \int_\Omega X \, dP, the integral is the Lebesgue integral you already know, applied to the measure PP.

The only genuinely new feature is the normalisation P(Ω)=1P(\Omega) = 1, which enables probabilistic interpretations: P(A)P(A) is the long-run fraction of experiments in which event AA occurs (frequentist), or your degree of belief that AA will occur (Bayesian). The mathematics is the same in either case.

Summary

  • A probability space (Ω,F,P)(\Omega, \mathcal{F}, P) consists of a sample space, a σ\sigma-algebra, and a probability measure satisfying the three Kolmogorov axioms: non-negativity, normalisation (P(Ω)=1P(\Omega) = 1), and σ\sigma-additivity.
  • Immediate consequences: P()=0P(\emptyset) = 0; P(Ac)=1P(A)P(A^c) = 1 - P(A); monotonicity (ABP(A)P(B)A \subseteq B \Rightarrow P(A) \leq P(B)); inclusion–exclusion; the union bound.
  • Continuity from below and above: PP is continuous with respect to monotone limits of events, which is σ\sigma-additivity applied to nested sequences.
  • A probability measure is just a measure with total mass 11: all Lebesgue integration theorems hold for expectations out of the box.