Law of Total Expectation

Essential
Last updated: Tags: Probability, Conditional Probability, Expectation

Prerequisites

The law of total probability says P(A)=iP(ABi)P(Bi)P(A) = \sum_i P(A \mid B_i) P(B_i). The law of total expectation is the exact analogue for expectations: averaging the conditional expectation over the conditioning variable recovers the unconditional expectation.

Statement and proof

Theorem (Law of total expectation). For any integrable random variable XX and random variable YY:

E[X]=E ⁣[E[XY]].(1)E[X] = E\!\left[E[X \mid Y]\right]. \tag{1}

Proof in the discrete case

Let YY take values y1,y2,y_1, y_2, \ldots with P(Y=yi)=piP(Y = y_i) = p_i. Since E[XY]E[X \mid Y] is the random variable that equals E[XY=yi]E[X \mid Y = y_i] on the event {Y=yi}\{Y = y_i\}:

E ⁣[E[XY]]=iE[XY=yi]pi=ikxkP(X=xkY=yi)pi.E\!\left[E[X \mid Y]\right] = \sum_i E[X \mid Y = y_i] \, p_i = \sum_i \sum_k x_k \, P(X = x_k \mid Y = y_i) \, p_i.

Using P(X=xkY=yi)pi=P(X=xk,Y=yi)P(X = x_k \mid Y = y_i) \cdot p_i = P(X = x_k, Y = y_i):

=kxkiP(X=xk,Y=yi)=kxkP(X=xk)=E[X].= \sum_k x_k \sum_i P(X = x_k, Y = y_i) = \sum_k x_k \, P(X = x_k) = E[X]. \qquad \square

Proof in the absolutely continuous case

If (X,Y)(X, Y) has joint density fX,Yf_{X,Y} and marginals fXf_X, fYf_Y, then with fXY(xy)=fX,Y(x,y)/fY(y)f_{X \mid Y}(x \mid y) = f_{X,Y}(x,y)/f_Y(y):

E ⁣[E[XY]]=+E[XY=y]fY(y)dy=+ ⁣ ⁣(+xfXY(xy)dx)fY(y)dy.E\!\left[E[X \mid Y]\right] = \int_{-\infty}^{+\infty} E[X \mid Y = y] \, f_Y(y) \, dy = \int_{-\infty}^{+\infty} \!\!\left(\int_{-\infty}^{+\infty} x \, f_{X \mid Y}(x \mid y) \, dx\right) f_Y(y) \, dy.

Since fXY(xy)fY(y)=fX,Y(x,y)f_{X \mid Y}(x \mid y) f_Y(y) = f_{X,Y}(x,y), Fubini’s theorem gives:

=+x(+fX,Y(x,y)dy)dx=+xfX(x)dx=E[X].= \int_{-\infty}^{+\infty} x \left(\int_{-\infty}^{+\infty} f_{X,Y}(x,y) \, dy\right) dx = \int_{-\infty}^{+\infty} x \, f_X(x) \, dx = E[X]. \qquad \square

Conditioning as a computational strategy

The strategic value of (1)(1) is in the choice of YY: pick a conditioning variable that makes E[XY=y]E[X \mid Y = y] easy to compute, then combine using the distribution of YY.

Example. Items are produced in batches. The batch size NN is geometric with parameter p=0.5p = 0.5 (so E[N]=1/p=2E[N] = 1/p = 2). Given a batch of size nn, each item is independently defective with probability q=0.1q = 0.1. Let DD be the total number of defective items.

Conditioning on NN: given N=nN = n, DN=nD \mid N = n is Binomial(n,q)\operatorname{Binomial}(n, q) with mean nqnq, so

E[DN]=Nq.E[D \mid N] = Nq.

By the law of total expectation:

E[D]=E[E[DN]]=E[Nq]=qE[N]=0.1×2=0.2.E[D] = E[E[D \mid N]] = E[Nq] = q \, E[N] = 0.1 \times 2 = 0.2.

Law of total variance

A companion identity decomposes the variance of XX into two interpretable parts:

Var(X)=E[Var(XY)]+Var(E[XY]).(2)\operatorname{Var}(X) = E[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(E[X \mid Y]). \tag{2}

The first term E[Var(XY)]E[\operatorname{Var}(X \mid Y)] is the within-group variance — the average variability of XX within each level of YY. The second term Var(E[XY])\operatorname{Var}(E[X \mid Y]) is the between-group variance — how much the conditional mean E[XY=y]E[X \mid Y = y] itself varies across levels of YY.

Proof. Use Var(Z)=E[Z2](E[Z])2\operatorname{Var}(Z) = E[Z^2] - (E[Z])^2 and apply the law of total expectation twice:

E[X2]=E ⁣[E[X2Y]]=E ⁣[Var(XY)+(E[XY])2].E[X^2] = E\!\left[E[X^2 \mid Y]\right] = E\!\left[\operatorname{Var}(X \mid Y) + (E[X \mid Y])^2\right].

Also (E[X])2=(E[E[XY]])2(E[X])^2 = (E[E[X \mid Y]])^2. Subtracting:

Var(X)=E[X2](E[X])2=E[Var(XY)]+E[(E[XY])2](E[E[XY]])2=E[Var(XY)]+Var(E[XY]).\operatorname{Var}(X) = E[X^2] - (E[X])^2 = E[\operatorname{Var}(X \mid Y)] + E[(E[X \mid Y])^2] - (E[E[X \mid Y]])^2 = E[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(E[X \mid Y]).

Example (continued). With DD and NN as above, Var(DN=n)=nq(1q)\operatorname{Var}(D \mid N = n) = nq(1-q), so Var(DN)=Nq(1q)\operatorname{Var}(D \mid N) = Nq(1-q) and E[DN]=NqE[D \mid N] = Nq.

For geometric(p)(p) we have Var(N)=(1p)/p2=2\operatorname{Var}(N) = (1-p)/p^2 = 2.

  • Within-group: E[Var(DN)]=E[Nq(1q)]=q(1q)E[N]=0.1×0.9×2=0.18E[\operatorname{Var}(D \mid N)] = E[Nq(1-q)] = q(1-q) E[N] = 0.1 \times 0.9 \times 2 = 0.18.
  • Between-group: Var(E[DN])=Var(Nq)=q2Var(N)=0.01×2=0.02\operatorname{Var}(E[D \mid N]) = \operatorname{Var}(Nq) = q^2 \operatorname{Var}(N) = 0.01 \times 2 = 0.02.

So Var(D)=0.18+0.02=0.20\operatorname{Var}(D) = 0.18 + 0.02 = 0.20.

Summary

  • Law of total expectation: E[X]=E[E[XY]]E[X] = E[E[X \mid Y]] — averaging the conditional expectation over the conditioning variable recovers the unconditional expectation.
  • Strategic use: choose YY so that E[XY=y]E[X \mid Y = y] has a simple closed form, then combine using the distribution of YY.
  • Law of total variance: Var(X)=E[Var(XY)]+Var(E[XY])\operatorname{Var}(X) = E[\operatorname{Var}(X \mid Y)] + \operatorname{Var}(E[X \mid Y]) — total variance splits into within-group variance and between-group variance.