The law of total probability says P(A)=∑iP(A∣Bi)P(Bi). The law of total expectation is the exact analogue for expectations: averaging the conditional expectation over the conditioning variable recovers the unconditional expectation.
Statement and proof
Theorem (Law of total expectation). For any integrable random variable X and random variable Y:
E[X]=E[E[X∣Y]].(1)
Proof in the discrete case
Let Y take values y1,y2,… with P(Y=yi)=pi. Since E[X∣Y] is the random variable that equals E[X∣Y=yi] on the event {Y=yi}:
E[E[X∣Y]]=i∑E[X∣Y=yi]pi=i∑k∑xkP(X=xk∣Y=yi)pi.
Using P(X=xk∣Y=yi)⋅pi=P(X=xk,Y=yi):
=k∑xki∑P(X=xk,Y=yi)=k∑xkP(X=xk)=E[X].□
Proof in the absolutely continuous case
If (X,Y) has joint density fX,Y and marginals fX, fY, then with fX∣Y(x∣y)=fX,Y(x,y)/fY(y):
E[E[X∣Y]]=∫−∞+∞E[X∣Y=y]fY(y)dy=∫−∞+∞(∫−∞+∞xfX∣Y(x∣y)dx)fY(y)dy.
Since fX∣Y(x∣y)fY(y)=fX,Y(x,y), Fubini’s theorem gives:
=∫−∞+∞x(∫−∞+∞fX,Y(x,y)dy)dx=∫−∞+∞xfX(x)dx=E[X].□
Conditioning as a computational strategy
The strategic value of (1) is in the choice of Y: pick a conditioning variable that makes E[X∣Y=y] easy to compute, then combine using the distribution of Y.
Example. Items are produced in batches. The batch size N is geometric with parameter p=0.5 (so E[N]=1/p=2). Given a batch of size n, each item is independently defective with probability q=0.1. Let D be the total number of defective items.
Conditioning on N: given N=n, D∣N=n is Binomial(n,q) with mean nq, so
E[D∣N]=Nq.
By the law of total expectation:
E[D]=E[E[D∣N]]=E[Nq]=qE[N]=0.1×2=0.2.
Law of total variance
A companion identity decomposes the variance of X into two interpretable parts:
Var(X)=E[Var(X∣Y)]+Var(E[X∣Y]).(2)
The first term E[Var(X∣Y)] is the within-group variance — the average variability of X within each level of Y. The second term Var(E[X∣Y]) is the between-group variance — how much the conditional mean E[X∣Y=y] itself varies across levels of Y.
Proof. Use Var(Z)=E[Z2]−(E[Z])2 and apply the law of total expectation twice:
E[X2]=E[E[X2∣Y]]=E[Var(X∣Y)+(E[X∣Y])2].
Also (E[X])2=(E[E[X∣Y]])2. Subtracting:
Var(X)=E[X2]−(E[X])2=E[Var(X∣Y)]+E[(E[X∣Y])2]−(E[E[X∣Y]])2=E[Var(X∣Y)]+Var(E[X∣Y]).
Example (continued). With D and N as above, Var(D∣N=n)=nq(1−q), so Var(D∣N)=Nq(1−q) and E[D∣N]=Nq.
For geometric(p) we have Var(N)=(1−p)/p2=2.
- Within-group: E[Var(D∣N)]=E[Nq(1−q)]=q(1−q)E[N]=0.1×0.9×2=0.18.
- Between-group: Var(E[D∣N])=Var(Nq)=q2Var(N)=0.01×2=0.02.
So Var(D)=0.18+0.02=0.20.
Summary
- Law of total expectation: E[X]=E[E[X∣Y]] — averaging the conditional expectation over the conditioning variable recovers the unconditional expectation.
- Strategic use: choose Y so that E[X∣Y=y] has a simple closed form, then combine using the distribution of Y.
- Law of total variance: Var(X)=E[Var(X∣Y)]+Var(E[X∣Y]) — total variance splits into within-group variance and between-group variance.