Expectation

Essential
Last updated: Tags: Probability, Expectation

Prerequisites

The expectation (or expected value, or mean) of a random variable is its probability-weighted average. For a die roll you might compute 16(1+2+3+4+5+6)=3.5\frac{1}{6}(1 + 2 + 3 + 4 + 5 + 6) = 3.5 — that is expectation for a uniform discrete variable. The measure-theoretic definition unifies this with the continuous case and gives a framework in which all the expected algebraic rules are provable theorems.

Definition

Let (Ω,F,P)(\Omega, \mathcal{F}, P) be a probability space and X:ΩRX : \Omega \to \mathbb{R} a random variable. The expectation of XX is its Lebesgue integral against PP:

E[X]ΩX(ω)dP(ω),E[X] \coloneqq \int_\Omega X(\omega) \, dP(\omega),

provided the integral exists (i.e.\ ΩXdP<\int_\Omega |X| \, dP < \infty; then XX is called integrable). By the change-of-variables formula for push-forward measures, this equals the integral against the distribution PXP_X:

E[X]=RxdPX(x)=RxdFX(x),E[X] = \int_{\mathbb{R}} x \, dP_X(x) = \int_{\mathbb{R}} x \, dF_X(x),

where FXF_X is the CDF of XX and the last integral is a Lebesgue–Stieltjes integral.

Discrete case

If XX is discrete with support {x1,x2,}\{x_1, x_2, \ldots\} and P(X=xk)=pkP(X = x_k) = p_k, then

E[X]=kxkpk,E[X] = \sum_{k} x_k \, p_k,

provided kxkpk<\sum_k |x_k| p_k < \infty.

Absolutely continuous case

If XX has probability density function fXf_X, then

E[X]=+xfX(x)dx,E[X] = \int_{-\infty}^{+\infty} x \, f_X(x) \, dx,

provided +xfX(x)dx<\int_{-\infty}^{+\infty} |x| f_X(x) \, dx < \infty.

Both formulas are specialisations of the single Lebesgue integral xdPX(x)\int x \, dP_X(x) — the discrete case integrates against a sum of point masses, the continuous case against a Lebesgue-absolutely-continuous measure.

Linearity of expectation

Theorem. For any integrable random variables X,YX, Y and constants a,bRa, b \in \mathbb{R}:

E[aX+bY]=aE[X]+bE[Y].(1)E[aX + bY] = a \, E[X] + b \, E[Y]. \tag{1}

Proof. Linearity of the Lebesgue integral: (aX+bY)dP=aXdP+bYdP\int (aX + bY) \, dP = a \int X \, dP + b \int Y \, dP.

Linearity holds without any assumption of independence. This is one of the most powerful tools in probability: you can always decompose a complicated random variable into simpler parts and add expectations.

Example. For Sn=X1+X2++XnS_n = X_1 + X_2 + \cdots + X_n where each XiX_i has the same mean μ\mu:

E[Sn]=E[X1]+E[X2]++E[Xn]=nμ,E[S_n] = E[X_1] + E[X_2] + \cdots + E[X_n] = n\mu,

regardless of whether X1,,XnX_1, \ldots, X_n are independent, correlated, or even identically distributed.

Law of the unconscious statistician (LOTUS)

Computing E[g(X)]E[g(X)] for a function g:RRg : \mathbb{R} \to \mathbb{R} does not require knowing the distribution of g(X)g(X) explicitly.

Theorem (LOTUS). If XX is a random variable with distribution PXP_X and g:RRg : \mathbb{R} \to \mathbb{R} is measurable, then

E[g(X)]=Rg(x)dPX(x).E[g(X)] = \int_{\mathbb{R}} g(x) \, dP_X(x).

In the discrete case this is kg(xk)pk\sum_k g(x_k) p_k, and in the absolutely continuous case +g(x)fX(x)dx\int_{-\infty}^{+\infty} g(x) f_X(x) \, dx.

Proof sketch. g(X)g(X) is the composition of measurable maps ΩXRgR\Omega \xrightarrow{X} \mathbb{R} \xrightarrow{g} \mathbb{R}, so it is a random variable. Its expectation is Ωg(X(ω))dP(ω)\int_\Omega g(X(\omega)) \, dP(\omega). Applying the push-forward change-of-variables formula converts the integral over Ω\Omega to an integral over R\mathbb{R} against PXP_X.

Example. For XUniform(0,1)X \sim \operatorname{Uniform}(0, 1) with density fX(x)=1f_X(x) = 1:

E[X2]=01x21dx=13.E[X^2] = \int_0^1 x^2 \cdot 1 \, dx = \frac{1}{3}.

Expectation of non-negative random variables

For a non-negative random variable X0X \geq 0, the expectation always exists (possibly as ++\infty):

E[X]=0P(X>t)dt.(2)E[X] = \int_0^{\infty} P(X > t) \, dt. \tag{2}

This layer-cake formula converts a one-dimensional integral over R\mathbb{R} into an integral over [0,)[0, \infty) of survival probabilities. It is especially useful when P(X>t)P(X > t) has a simple form.

Proof sketch. By Tonelli’s theorem:

0P(X>t)dt=0Ω1X(ω)>tdP(ω)dt=Ω01t<X(ω)dtdP(ω)=ΩX(ω)dP(ω)=E[X].\int_0^\infty P(X > t) \, dt = \int_0^\infty \int_\Omega \mathbf{1}_{X(\omega) > t} \, dP(\omega) \, dt = \int_\Omega \int_0^\infty \mathbf{1}_{t < X(\omega)} \, dt \, dP(\omega) = \int_\Omega X(\omega) \, dP(\omega) = E[X].

Jensen’s inequality

For a convex function φ:RR\varphi : \mathbb{R} \to \mathbb{R} and an integrable random variable XX:

φ(E[X])E[φ(X)].(3)\varphi(E[X]) \leq E[\varphi(X)]. \tag{3}

For a concave function the inequality reverses. This is one of the most-used inequalities in probability and statistics.

Examples.

  • φ(x)=x2\varphi(x) = x^2 is convex, so (E[X])2E[X2](E[X])^2 \leq E[X^2], or equivalently Var(X)=E[X2](E[X])20\operatorname{Var}(X) = E[X^2] - (E[X])^2 \geq 0.
  • φ(x)=ex\varphi(x) = e^x is convex, so eE[X]E[eX]e^{E[X]} \leq E[e^X].
  • φ(x)=lnx\varphi(x) = \ln x is concave on (0,)(0, \infty), so E[lnX]lnE[X]E[\ln X] \leq \ln E[X].

Summary

  • The expectation E[X]=ΩXdP=RxdFX(x)E[X] = \int_\Omega X \, dP = \int_{\mathbb{R}} x \, dF_X(x) is the Lebesgue integral of XX against the probability measure; it exists when XdP<\int |X| \, dP < \infty.
  • Discrete: E[X]=kxkpkE[X] = \sum_k x_k p_k. Absolutely continuous: E[X]=xf(x)dxE[X] = \int x f(x) \, dx. Both are special cases of the same integral.
  • Linearity: E[aX+bY]=aE[X]+bE[Y]E[aX + bY] = aE[X] + bE[Y] holds without any independence assumption.
  • LOTUS: E[g(X)]=g(x)dPX(x)E[g(X)] = \int g(x) \, dP_X(x) — integrate gg against the distribution of XX, not of g(X)g(X).
  • Layer-cake formula: for X0X \geq 0, E[X]=0P(X>t)dtE[X] = \int_0^\infty P(X > t) \, dt.
  • Jensen’s inequality: φ(E[X])E[φ(X)]\varphi(E[X]) \leq E[\varphi(X)] for convex φ\varphi.