Conditional Expectation

Essential
Last updated: Tags: Probability, Conditional Probability, Expectation

The expectation E[X]E[X] is the probability-weighted average of XX over the whole sample space. Conditional expectation asks the same question but restricts to a sub-population: given that event BB occurred, or given that a random variable YY took value yy, what is the average of XX?

Conditional expectation given an event

Let BFB \in \mathcal{F} with P(B)>0P(B) > 0 and let XX be an integrable random variable. The conditional expectation of XX given BB is the expectation of XX under the conditional probability P(B)P(\cdot \mid B):

E[XB]ΩXdP(B).E[X \mid B] \coloneqq \int_\Omega X \, dP(\cdot \mid B).

In the discrete case (where XX takes values x1,x2,x_1, x_2, \ldots):

E[XB]=kxkP(X=xkB).E[X \mid B] = \sum_k x_k \, P(X = x_k \mid B).

In the absolutely continuous case, if (X,1B)(X, \mathbf{1}_B) has a well-defined conditional density fXBf_{X \mid B}:

E[XB]=+xfXB(x)dx.E[X \mid B] = \int_{-\infty}^{+\infty} x \, f_{X \mid B}(x) \, dx.

The result E[XB]E[X \mid B] is a constant — it is a single number, not a random variable.

Conditional expectation given a random variable

The more general and powerful concept conditions on the value of a random variable YY.

Discrete case

If YY takes values y1,y2,y_1, y_2, \ldots and P(Y=y)>0P(Y = y) > 0, define for each such yy:

E[XY=y]kxkP(X=xkY=y).E[X \mid Y = y] \coloneqq \sum_k x_k \, P(X = x_k \mid Y = y).

Jointly continuous case

If (X,Y)(X, Y) has joint density fX,Yf_{X,Y} and marginal fY(y)>0f_Y(y) > 0, the conditional density of XX given Y=yY = y is

fXY(xy)=fX,Y(x,y)fY(y),f_{X \mid Y}(x \mid y) = \frac{f_{X,Y}(x, y)}{f_Y(y)},

and

E[XY=y]=+xfXY(xy)dx.E[X \mid Y = y] = \int_{-\infty}^{+\infty} x \, f_{X \mid Y}(x \mid y) \, dx.

E[XY]E[X \mid Y] as a random variable

The expression E[XY=y]E[X \mid Y = y] is a deterministic function of yy; call it g(y)g(y). Composing with YY gives the conditional expectation

E[XY]g(Y).E[X \mid Y] \coloneqq g(Y).

This is a random variable — a function of the random variable YY. Before YY is observed you do not know which value g(y)g(y) will take. Informally, E[XY]E[X \mid Y] is the best prediction of XX from YY in the mean-square sense: among all functions h(Y)h(Y), the one minimising E[(Xh(Y))2]E[(X - h(Y))^2] is h=gh = g.

Key properties

Throughout, X,X1,X2X, X_1, X_2 are integrable random variables and Y,ZY, Z are arbitrary random variables.

Linearity

E[aX1+bX2Y]=aE[X1Y]+bE[X2Y].E[a X_1 + b X_2 \mid Y] = a \, E[X_1 \mid Y] + b \, E[X_2 \mid Y].

Monotonicity

If X1X2X_1 \leq X_2 almost surely, then E[X1Y]E[X2Y]E[X_1 \mid Y] \leq E[X_2 \mid Y] almost surely.

Taking out what is known

If hh is a measurable function such that h(Y)Xh(Y) X is integrable:

E[h(Y)XY]=h(Y)E[XY].(1)E[h(Y) \, X \mid Y] = h(Y) \, E[X \mid Y]. \tag{1}

Once you know YY, the factor h(Y)h(Y) is a constant from the perspective of P(Y=y)P(\cdot \mid Y = y) and factors out of the expectation.

Example. If YY and XX are independent, then E[XY]=E[X]E[X \mid Y] = E[X] (a constant function), and the identity gives E[YXY]=YE[X]E[Y \cdot X \mid Y] = Y \cdot E[X], so E[YX]=E[Y]E[X]E[YX] = E[Y] E[X] — recovering the standard independence formula.

Iterated conditioning

If Y=f(Z)Y = f(Z) for some measurable function ff (so YY is “coarser” than ZZ):

E ⁣[E[XZ]Y]=E[XY].(2)E\!\left[E[X \mid Z] \mid Y\right] = E[X \mid Y]. \tag{2}

Conditioning on YY after having conditioned on the finer ZZ washes out the extra precision: you end up with just the YY-level information.

Summary

  • E[XB]E[X \mid B]: the expectation of XX under P(B)P(\cdot \mid B); a constant when BB is a fixed event with P(B)>0P(B) > 0.
  • E[XY=y]E[X \mid Y = y]: the conditional mean of XX when YY is known to equal yy; a deterministic function of yy.
  • E[XY]E[X \mid Y]: the random variable g(Y)g(Y) where g(y)=E[XY=y]g(y) = E[X \mid Y = y]; it is the best mean-square predictor of XX from YY.
  • Key properties: linearity, monotonicity, taking out what is known (E[h(Y)XY]=h(Y)E[XY]E[h(Y) X \mid Y] = h(Y) E[X \mid Y]), and iterated conditioning (E[E[XZ]Y]=E[XY]E[E[X \mid Z] \mid Y] = E[X \mid Y] when YY is coarser than ZZ).