Variance

Essential
Last updated: Tags: Probability, Expectation

Prerequisites

Expectation tells you where the distribution is centred. Variance tells you how spread out it is around that centre — how much a typical observation deviates from the mean. It is the simplest second-order property of a distribution, and it underlies everything from the standard error of an estimator to the volatility of a financial instrument.

Definition

Let XX be a random variable with finite expectation μE[X]\mu \coloneqq E[X]. The variance of XX is

Var(X)E[(Xμ)2].\operatorname{Var}(X) \coloneqq E\bigl[(X - \mu)^2\bigr].

It is the expected squared deviation of XX from its mean. Since (Xμ)20(X - \mu)^2 \geq 0, the variance is always non-negative: Var(X)0\operatorname{Var}(X) \geq 0.

The standard deviation is

σXVar(X),\sigma_X \coloneqq \sqrt{\operatorname{Var}(X)},

which has the same physical units as XX itself. Variance is the more algebraically convenient quantity, but standard deviation is what you report in practice.

The computational formula

Expanding the square and applying linearity of expectation gives a formula that avoids computing μ\mu first:

Var(X)=E[X2](E[X])2.(1)\operatorname{Var}(X) = E[X^2] - (E[X])^2. \tag{1}

Proof.

Var(X)=E[(Xμ)2]=E[X22μX+μ2]=E[X2]2μE[X]+μ2=E[X2]2μ2+μ2=E[X2]μ2.\operatorname{Var}(X) = E\bigl[(X - \mu)^2\bigr] = E[X^2 - 2\mu X + \mu^2] = E[X^2] - 2\mu E[X] + \mu^2 = E[X^2] - 2\mu^2 + \mu^2 = E[X^2] - \mu^2.

Formula (1)(1) is the standard computational shortcut: you only need to compute E[X]E[X] and E[X2]E[X^2].

Example. For XUniform(a,b)X \sim \operatorname{Uniform}(a, b) with density f(x)=1baf(x) = \frac{1}{b-a} on [a,b][a,b]:

E[X]=a+b2,E[X2]=a2+ab+b23,E[X] = \frac{a+b}{2}, \qquad E[X^2] = \frac{a^2 + ab + b^2}{3},

so

Var(X)=a2+ab+b23(a+b)24=(ba)212.\operatorname{Var}(X) = \frac{a^2 + ab + b^2}{3} - \frac{(a+b)^2}{4} = \frac{(b-a)^2}{12}.

Scaling and shift rules

Theorem. For constants a,bRa, b \in \mathbb{R}:

Var(aX+b)=a2Var(X).(2)\operatorname{Var}(aX + b) = a^2 \operatorname{Var}(X). \tag{2}

Proof. Let Y=aX+bY = aX + b. Then E[Y]=aE[X]+bE[Y] = aE[X] + b, so YE[Y]=a(XE[X])Y - E[Y] = a(X - E[X]). Therefore:

Var(Y)=E[(YE[Y])2]=E[a2(XE[X])2]=a2Var(X).\operatorname{Var}(Y) = E\bigl[(Y - E[Y])^2\bigr] = E\bigl[a^2 (X - E[X])^2\bigr] = a^2 \operatorname{Var}(X).

Two observations:

  • Shifts don’t affect variance. Adding a constant bb moves the distribution but doesn’t change its spread.
  • Scaling squares. Multiplying XX by aa multiplies the standard deviation by a|a| and the variance by a2a^2.

Equivalently, σaX+b=aσX\sigma_{aX+b} = |a| \sigma_X.

Variance is not linear

Unlike expectation, variance is not linear: Var(X+Y)Var(X)+Var(Y)\operatorname{Var}(X + Y) \neq \operatorname{Var}(X) + \operatorname{Var}(Y) in general. The correct formula involves the covariance:

Var(X+Y)=Var(X)+2Cov(X,Y)+Var(Y).\operatorname{Var}(X + Y) = \operatorname{Var}(X) + 2\operatorname{Cov}(X, Y) + \operatorname{Var}(Y).

When XX and YY are independent, Cov(X,Y)=0\operatorname{Cov}(X, Y) = 0 (proved in Independence of Random Variables), so independence does give additivity:

Var(X+Y)=Var(X)+Var(Y)when X,Y independent.\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) \quad \text{when } X, Y \text{ independent.}

More generally, for nn independent variables:

Var(X1++Xn)=Var(X1)++Var(Xn).\operatorname{Var}(X_1 + \cdots + X_n) = \operatorname{Var}(X_1) + \cdots + \operatorname{Var}(X_n).

Chebyshev’s inequality

Variance bounds the probability of large deviations from the mean. Chebyshev’s inequality states: for any k>0k > 0,

P(Xμkσ)1k2.(3)P\bigl(|X - \mu| \geq k\sigma\bigr) \leq \frac{1}{k^2}. \tag{3}

More generally, for any ε>0\varepsilon > 0:

P(Xμε)Var(X)ε2.P\bigl(|X - \mu| \geq \varepsilon\bigr) \leq \frac{\operatorname{Var}(X)}{\varepsilon^2}.

Proof. By Markov’s inequality applied to the non-negative random variable (Xμ)2(X - \mu)^2:

P((Xμ)2ε2)E[(Xμ)2]ε2=Var(X)ε2.P\bigl((X - \mu)^2 \geq \varepsilon^2\bigr) \leq \frac{E[(X-\mu)^2]}{\varepsilon^2} = \frac{\operatorname{Var}(X)}{\varepsilon^2}.

Chebyshev’s inequality is weak (it holds for any distribution) but universally applicable. It is the key tool in proving the weak law of large numbers: if X1,X2,X_1, X_2, \ldots are i.i.d.\ with mean μ\mu and finite variance, then Xn=1nk=1nXk\overline{X}_n = \frac{1}{n}\sum_{k=1}^n X_k converges in probability to μ\mu.

Quick proof. E[Xn]=μE[\overline{X}_n] = \mu and Var(Xn)=Var(X1)n0\operatorname{Var}(\overline{X}_n) = \frac{\operatorname{Var}(X_1)}{n} \to 0 by independence and the scaling rule. Chebyshev gives P(Xnμε)Var(X1)nε20P(|\overline{X}_n - \mu| \geq \varepsilon) \leq \frac{\operatorname{Var}(X_1)}{n\varepsilon^2} \to 0.

Why variance rather than mean absolute deviation?

Variance squares the deviation. An alternative spread measure is the mean absolute deviation E[Xμ]E[|X - \mu|]. Both capture spread, but variance has three practical advantages:

  1. Algebra. The variance of a sum of independent variables is the sum of variances (as above). The analogous result for mean absolute deviation fails.
  2. Smoothness. The function xx2x \mapsto x^2 is everywhere differentiable; xxx \mapsto |x| is not differentiable at 00. Variance appears naturally in calculus-based derivations (ordinary least squares, Fisher information, etc.).
  3. Completeness. Variance extends to the covariance matrix for multivariate distributions, which mean absolute deviation cannot.

The cost is interpretability: σ2\sigma^2 has units of (units of X)2(\text{units of } X)^2, which is why standard deviation σ\sigma is always reported alongside variance.

Summary

  • Var(X)E[(XE[X])2]0\operatorname{Var}(X) \coloneqq E[(X - E[X])^2] \geq 0 measures average squared spread around the mean.
  • Computational formula: Var(X)=E[X2](E[X])2\operatorname{Var}(X) = E[X^2] - (E[X])^2.
  • Scaling rule: Var(aX+b)=a2Var(X)\operatorname{Var}(aX + b) = a^2 \operatorname{Var}(X); shifts do not affect variance.
  • Independence additivity: Var(X+Y)=Var(X)+Var(Y)\operatorname{Var}(X + Y) = \operatorname{Var}(X) + \operatorname{Var}(Y) when X,YX, Y are independent.
  • Chebyshev’s inequality: P(Xμε)Var(X)/ε2P(|X - \mu| \geq \varepsilon) \leq \operatorname{Var}(X) / \varepsilon^2 — a universal (though weak) tail bound.
  • Variance is algebraically natural (additive under independence, smooth); standard deviation σ=Var(X)\sigma = \sqrt{\operatorname{Var}(X)} is what you report because it matches the units of XX.