Skip to content

0x040 Probability

This note is a mixture of measure-based and non-measure-based probability.

1. Measure Theory

Details of measure theory will be described in the real analysis note

1.1. Probability Space

Definition (sample space) \(\Omega\) is called the sample space. Intuitively, it is the set of possible outcomes of an experiment.

Unfortunately, we can not assign measure to every subset of sample space in general, therefore, we only consider the event space which is a subset of all subsets.

Definition (event space) \(\mathcal{F}\) is called the event space when \(\mathcal{F}\) is a \(\sigma\)-algebra on a set \(\Omega\). An event is an element of \(\mathcal{F}\)

On this proper event space, we can define the probability measure

Definition (probability measure) A probability measure on \((\Omega, \mathcal{F})\) is a measure \(P\) on \((\Omega, \mathcal{F})\) such that \(P(\Omega)=1\). If \(A\) is an event, then \(P(A)\) is called the probability of \(A\)

Recall the measure has the following properties, let \(\mu\) be a measure on \((\Omega, \mathcal{F})\)

  • (monotonicity) If \(A \subset B\), then \(\mu(A) \leq \mu(B)\)
  • (subadditivty) If \(A \subset \cup_{m=1}^{\infty} A_m\), then \(\mu(A) \leq \sigma_{m=1}^\infty \mu(A_m)\)
  • (continuity from below) If \(A_i \uparrow A\) (i.e. $A_1 \subset A_2 \subset ... $), then \(\mu(A_i ) \uparrow \mu(A)\)
  • (continuity from above) If \(A_i \downarrow A\), then \(\mu(A_i) \downarrow \mu(A)\)

With these three components, we can define the probability space

Definition (probability space, Kolmogorov) If \(P\) is a probability measure on \((\Omega, \mathcal{F})\), then the triplet \((\Omega, \mathcal{F}, P)\) is called a probability space.

discrete probability space

Let \(\Omega\) be a countable set, \(\mathcal{F}\) be the sets of all subsets of \(\Omega\) and

\[P(A) = \sum_{\omega \in A} p(\omega)\]

where \(\sum_{\omega \in \Omega} p(\omega) = 1\)

Frequentist and Bayesian interpretation

There are two interpretations \(P(A)\). The two common interpretations are frequencies and degrees of beliefs (Bayesian).

  • Frequentist says that \(P(A)\) is the long run proportion of times that \(A\) is true in repetition.
  • The degree of belief interpretation says that \(P(A)\) is the observer's strength of belief that \(A\) is true.

Note that probability in quantum mechanics probably does not belong to either of these interpretations.

1.2. Distribution

random variable is neither random nor a variable

Definition (Random Variable) Suppose \((\Omega, \mathcal{F}, P)\) is a probability space. A random variable on \((\Omega, \mathcal{F})\) is a measurable function from \(\Omega\) to \(R\).

Intuitively, a random variable is a function from the sample space to another sample space (i.e. R). Note that random variable can even be defined to project to measurable space other than \((R, B)\).

trivial random variable

If \(\Omega\) is a discrete probability space, then any function \(X: \Omega \to R\) is a random variable

A random variable is the indicator function of a set \(A \in \mathcal{F}\) iff

\[1_A(\omega) = \begin{cases} 1 & \omega \in A \\ 0 & \omega \notin A \end{cases}\]

Definition (distribution) If \(X\) is a random variable, then \(X\) induces a probability measure on \(R\) called its distribution by setting

\[\mu(A) = P(X \in A)\]

Defintion (distribution function) the distribution of a random variable \(X\) is usually descirbed by its distribution function

\[F(x) = P(X \leq x)\]

Every distribution function has the following properties. those are all following simple properties of measure

  • \(F\) is nondecreasing
  • \(\lim_{x \to \infty} F(x) = 1, \lim_{x \to -\infty} F(x) = 0\)
  • \(F\) is right-continuous: \(\lim_{y \downarrow x} F(y) = F(x)\)
  • Let \(F(x-) = \lim_{y \uparrow x} F(y)\), then \(F(x-) = P(X < x)\)
  • \(P(X=x) = F(x) - F(x-)\)

Conversely, if a function satisfies the top 3 properties, then is is the distribution function of some random variables

1.3. Conditional Probability

Conditional Probability All probabilities are calculated with respect to a sample space, but in many cases, we are in a position to update the sample space with new information. In this case, we use conditional probability.

Definition (Conditional Probability) If \(A,B\) are two events in a sample space and if \(P(B) > 0\) then the conditional probability of \(A\) given \(B\) is

\[P(A|B) = \frac{P(A \cap B)}{P(B)}\]

Note that \(B\) becomes that sample space here. In particular \(P(\dot | B)\) is a probability (satisfying Kolmogorov's axioms)

For any fixed \(B\) such that \(P(B) > 0\), \(P( \cdot | B)\) is a probability measure (satisfying three axioms of probability)

prosecutor's fallacy: fallacy from misunderstanding of \(P(A|B) \neq P(B|A)\)

Lemma : for any pair of events \(A\) and \(B\)

\[P(AB) = P(A|B)P(B) = P(B|A)P(A)\]

Theorem (The Law of Total Probability) Let \(A_1, ..., A_k\) be a partition of \(\Sigma\), Then for any event \(B\)

\[P(B) = \sum_i P(B|A_i)P(A_i)\]

Theorem (Bayes' Theorem)

\[ P(A_i | B) = \frac{P(B|A_i)P(A_i)}{\sum_j P(B|A_j)P(A_j)}\]

Independent Events Definition (Independence) Two events \(A\) and \(B\) are independent if

\[P(AB) = P(A)P(B)\]

Independence can arise in two distinct ways

  • explicitly assume independence
  • derive independence by verifying the previous definition

Note that disjoint events with positive probability is not independent.

Mutual independence is a much stronger assumption. Pairwise independence for all pairs does not imply mutual independence.

Definition (mutual independence) A collection of events \(A_1, ..., A_n\) are mutually independent iff for any subcollection \(A_{i_1}, ..., A_{i_k}\)

\[P( \cap_{j=1}^{k} A_{i_{j}} ) = \prod_{j} P(A_{i_j})\]

1.4. Random variable

random variable is neither random nor a variable

Definition (Random Variable) Suppose \((\Omega, \mathcal{F}, P)\) is a probability space. A random variable on \((\Omega, \mathcal{F})\) is a measurable function from \(\Omega\) to \(R\). Intuitively, a random variable is a function from the sample space to another sample space (i.e. R)

Note that random variable can even be defined to project to measurable space other than \((R, B)\).

Definition ((more general) random variable) Let \((E, \mathcal{E})\) be a measurable space. A mapping \(X: \Omega \to E\) is called a random variable if \(X\) is a measurable function with respect to \(\mathcal{F}\) and \(\mathcal{E}\), which means

\[\forall(A \in \mathcal{E}) X^{-1} (A) \in \mathcal{F}\]

Definition (induced probability function) The induced probability function with respect to the original function is defined as

\[ P_X(\{ x_i \}) = P_X(X = x_i) = P( \{ s_j \in S | X(s_j) = x_i \} ) \]

Note that this is a formal probability distribution, which means it satisfies Kolmogorov's axioms

Note that \(X\) is a discrete random variable if its range is countable

\[R_X = \{ x_1, x_2, ... \}\]

2. Law of Large Numbers

2.1. Independence

Measure theory ends and probability begins with the definition of independence.

Definition (independence)

  • Two events \(A\), \(B\) are independent if \(P(A \cap B) = P(A) P(B)\)
  • Two random variables \(X,Y\) are independent if for all \(C, D \in R\), \(P(X \in C, Y \in D) = P(X \in C)P(Y \in D)\)
  • two \(\sigma\)-fields \(\mathcal{F}, \mathcal{G}\) are independent if for all \(A \in \mathcal{F}, B \in \mathcal{G}\) the events are independence

2.2. Weak Law of Large Numbers

2.3. Borel-Cantelli Lemmas

2.4. Strong Law of Large Numbers

3. Central Limit Theorems

4. Univariate models

4.1. Transformation

Definition (transformation) If \(X\) is a random variable, then any function of \(X\), \(g(X)\) is also a random variable (if \(g\) is a Borel measurable function), then probability distribution of \(Y\) is defined by

\[P(Y \in A) = P(X \in g^{-1}(A))\]

Corollary (transformation of support) It is important to keep track of the sample spaces of \(X\) and \(Y\), the support of \(Y\) is

\[\mathcal{Y} = \{ y | (\exists x \in \mathcal{X}) y = g(x) \}\]

Corollary (monotone transformation of cdf) If \(X\) have cdf \(F_X(x)\), let \(Y=g(X)\)

if \(g\) is an increasing function, then

\[F_Y(y) = F_X(g^{-1}(y))\]

if \(g\) is a decreasing function, then

\[F_Y(y) = 1 - F_X(g^{-1}(y))\]

By taking derivative of both sides, we obtain the transformation rules of pdf for monotone functions.

Note this is a variant of the integration by substitution (derived from the fundamental theorem of calculus) where \(g^{-1} = \varphi\)

\[\int_{\varphi(a)}^{\varphi(b)} f(u) du =\int_a^b f(\varphi(x))\varphi'(x)dx\]

Theorem (monotone transformation of pdf) Let \(X\) have pdf \(f_X(x)\) and \(Y=g(X)\), where \(g\) is a monotone function. Suppose \(f_X(x)\) is continuous and \(g^{-1}(y)\) has a continuous derivative on \(\mathcal{Y}\), then

\[f_Y(y) = \begin{cases} f_X(g^{-1}(y))|\frac{d}{dy}g^{-1}(y)| & y \in \mathcal{Y} \\ 0 & \text{ otherwise}\end{cases}\]

Intuitively, the discussion above is simply

\[dF_X = dF_Y \implies f_X(x) dx = f_Y(y) dy\]

therefore, we get

\[f_Y(x) = f_X(x) |\frac{dx}{dy}|\]

Note

this only applies to the monotone functions, for functions that are not monotone (e.g.: \(Y=X^2\)), we need to compute a partition of \(X\) into where each \(X_i\) is monotone over \(g(X)\), then sum the inverse density \(f_X(g^{-1}(y))\) with its weight \(\frac{dg^{-1}(y)}{dy}\).

4.2. Expectation and Variance

Definition (expectation) Formally, Suppose \((\Omega, \mathcal{F}, P)\) is a probability space, If \(X \in \mathcal{L}^1 (P)\), then the expectation of the random variable \(X\) is devoted \(EX\) and defined by

\[EX = \int_{\Omega} X dP\]

When \(X\) is a discrete random variable with range $R_X = { x_1, x_2, ... } $ (finite or countably infinite). The expected value of \(X\), denoted by \(EX\) is defined as

\[ EX = \sum_{x_k \in R_X} x_k P(X=x_k) \]

Note that expectation does not always exist for any distribution, for example, the Cauchy distribution does have an expectation

\[f_X(x) = \frac{1}{\pi} \frac{1}{1+x^2}\]

Theorem (linearity of expectation)

\[E(aX + b) = aE(X) + b\]
\[E(X_1 + X_2 + ... + X_n) = EX_1 + EX_2 + ... + EX_n\]

Theorem (expectation of transformation) There are two ways to compute \(E[g(X)]\). One way is to compute PMF of \(Y = g(X)\), the other one is using follows (easier)

\[E(g(X)) = \int g(x) f_X(x) dx\]

Theorem (Jensen's inequality) when f is a convex function, we have the following inequality

\[f(E[X]) \leq E[f(x)]\]

gap in Jensen's inequality

with Talyor approximation, gap between \(f(E[X])\) and \(E[f(x)]\) can be interepreted to be decided by variance of \(X\) and convexity of \(f\) (e.g. second derivative)

See this math stackexchange's post

4.3. Moments

Moments reflects characteristics of distributions, however, the set of infinite moments is not enough to character the distribution. Two distinct random variables might have same moments set. To characterize distribution, both random variables have to have bounded support.

Definition (moment, central moment) For each integer \(n\), the \(n\)-th moment of \(X\), \(\mu_n\) is

\[\mu^{'}_n = EX^n\]

The \(n\)th central moment of \(X\), \(\mu_n\) is

\[\mu_n = E(X-\mu)^n\]

The 2nd central moment is the variance defined as follows

Definition (Variance) The variance of a random variable \(X\), with mean \(EX = \mu_X\) is defined as

\[ Var(X) = E[ (X-\mu_X)^2 ] \]

The standard deviation of a random variable \(X\) is defined as

\[SD(X) = \sigma_X = \sqrt{Var(X)}\]

A simple way to compute variance is as follows

\[Var(X) = E[X^2] - (EX)^2 \]

Lemma (relationship between moments) The previous \(Var(X)\) can be written as

\[\mu_2 = \mu'_2 - \mu^2\]

The 3rd moment and 4th moment have similar relationship

\[\mu_3 = \mu'_3 - 3\mu'_2\mu + 2\mu^3\]
\[\mu_4 = \mu'_4 - 4\mu'_3\mu + 6\mu'_2\mu^2 - 3\mu_4\]

Proposition (algebra of variance)

\[Var(aX + b) = a^2Var(X)\]
\[Var(X) = Var(X_1) + Var(X_2) + ... Var(X_n) \text{ if } X = X_1 + X_2 + ... + X_n\]
\[Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)\]

If \(X, Y\) are independent

\[Var(XY) = E(X^2Y^2) - (EXY)^2 = Var(X)Var(Y) + Var(X)E(Y)^2 + Var(Y)E(X)^2\]

Definition (Standardized moment) The standardized moment is the normalized central moment defined as

\[\hat{\mu}_k = \frac{\mu_k}{\sigma^k}\]

The 3rd standardized moment is called the skewness, which measures the lack of symmetry

The 4th standardized moment is called the kurtosis, which measures the peakedness of the pdf

While the moments might not be efficient to characterize distributions, the following moment generating function can characterize distributions if it exists

Definition (moment generating function, mgf) The moment generating function of \(X\), denoted by \(M_X(t)\) is following, provided that expectation exist for \(t\) in some neighborhood of 0.

\[M_X(t) = Ee^{tX}\]

Note: \(M_X(t)\) is the Laplace transform of \(f_X(x)\)

\[M_X(t) = \int_{-\infty}^{\infty} e^{tx} f_X(x) dx\]

Lemma (algebra over mgf)

\[M_{aX+b} = e^{bt} M_X(at)\]
\[M_{X+Y} = M_X(t)M_Y(t)\]

The moment generation function is called as it is because it can be used to generate moments by differentiation.

Theorem (moment generation) If \(X\) has mgf \(M_X(t)\), it can generate moments as follows

\[EX^n = M_X^{(n)} (0)\]

Note that the main use of the mgf is not to generate moments, but to characterize distributions, this characterizes an infinite set of moments. However, characterizing a infinite set of moments is not enough to determine a distribution uniquely. Two different distribution might have same moments.

To uniquely determine moments, we require the bounded support.

Theorem (determinations of distribution) Let \(F_X(x), F_Y(y)\) be two cdfs all of whose moments exist, If the moment generating function exists and \(M_X(t)=M_Y(t)\) for all \(t\) in some neighborhood of 0, then

\[F_X(u) = F_Y(u)\]

Theorem (convergence of mgfs) Convergence for \(|t| < h\) of mgfs to an mgf implies convergence of pdfs

While moment generating functions might not always exist, the characteristic function always exist and also characterize the random variable

Definition (characteristic function) The characteristic function for a random variable is defined as

\[\varphi_X(t) = E(e^{itX})\]

5. Multivariate Models

The probability models that involve more than one random variable are called multivariate models.

Definition (n-dimensional random vector) An n-dimensional random vector is a function from a sample space \(S\) in to \(R^n\), n-dimensional Euclidean space.

5.1. Joint and Marginal Distributions

The random vector is called a discrete random vector when it has only a countable number of possible values, otherwise it is called a continuous random vector.

Definition (joint PMF) Let \((X,Y)\) be a discrete bivariate random vector. Then the function \(f(x,y): R^2 \to R\) defined by \(f(x,y) = P(X=x, Y=y)\) is called the joint probability mass function or joint pmf of \((X,Y)\).

The joint pmf can be used to compute the probability of any event.

\[P((X,Y) \in A) = \sum_{(x,y) \in A} f(x,y)\]

Definition (marginal PMF) Let \((X,Y)\) be a discrete bivariate random vector with joint pmf \(f_{X,Y}(x,y)\). Then the marginal pmfs of \(X\), \(f_X(x) = P(X = x)\) is given by

\[f_X(x) = \sum_{t \in R} f_{X,Y}(x,y)\]

Definition (joint PDF) A function \(f(x,y): R^2 \to R\) is called a joint probability function or joint pdf of the continuous bivariate random vector \((X,Y)\) if for every event \(A \in R^2\),

\[P((X,Y) \in A) = \iint_A f_{XY}(x,y) dxdy\]

Definition (marginal PDF) The marginal probability density function of \(X,Y\) are also defined as in the discrete case with integrals replacing sums.

\[f_X(x) = \int_{-\infty}^{\infty} f_{XY}(x,y) dy\]

Definition (joint CDF) The joint distribution of \((X,Y)\) can also be completely described with the joint cdf

\[F_{XY}(x,y) = P(X \leq x, Y \leq y) = \int_{-\infty}^{y} \int_{-\infty}^{x} f_{XY}(u, v) du dv\]

5.2. Conditioning and Independence

Oftentimes when two random variables \((X,Y)\) are observed, the values of the two variables are related. Knowledge about the value of \(X\) gives us some infomation about the value of \(Y\).

Definition (conditional pmf, pdf) Let \((X,Y)\) be a discrete/continuous bivariate random vector with joint pmf/pdf \(f(x,y)\) and marginal pmfs/pdfs \(f_X(x), f_Y(y)\), the conditional pmf/pdf of \(Y\) given that \(X=x\) is the function of \(y\) denoted by \(f(y|x)\)

\[ f_{Y|X}(y|x) = \frac{f_{X,Y}(x,y)}{f_X(x)}\]

Note that this is a valid probability with respect to \(y\).

Since \(Y|X=x\) is a valid random variable, we can compute expectation of any function of \(Y\) \(g(Y)\)

Definition (Conditional expectation) The conditional expected value of \(g(Y)\) given that \(X=x\) is denoted by \(E(g(Y)|x)\)

\[E(g(Y)|x) = \int g(y) f(y|x) dy\]

Note that this is a function of \(x\)

Similarly, we can compute the conditional variance of \(Y|x\).

\[Var(Y|x) = E(Y^2|x) - (E(Y|x))^2\]

The conditional distribution of \(Y\) given \(X=x\) is possibly a different prob distribution for each \(x\), therefore we have a family of prob distribution for \(Y\) for each \(x\), when we wish to describe this entire family, we use the phrase "the distribution of \(Y|X\)

In some situations, the knowledge that \(X=x\) does not give us any information about \(Y\), this relationship is called independence.

Definition (independence) Let \((X,Y)\) be a bivariate random vector with joint pdf or pmf \(f(x,y)\) and marginal pdfs or pmfs \(f_X(x), f_Y(y)\). Then \(X,Y\) are called independent random variables if for every \(x, y \in R\)

\[f(x, y) = f_X(x) f_Y(y)\]

If they are independent

\[f(y|x) = f_Y(y)\]

To check that two random variables are independent, one way is to check all \(x, y \in R\) combinations. This require the knowledge of \(f_X(x), f_Y(y)\), which is sometimes difficult.

Criterion (joint pdf factorization) Another good criterion is to check whether the joint distribution \(f(x,y)\) can be factorized into two components as follows

\[f(x,y) = g(x) h(y)\]

Those independence can simplifying computations as follows

Theorem (independent computing) Suppose that \(X,Y\) are independent random variables, then their events are also independent which means

\[P(X \in A, Y \in B) = P(X \in A) P(Y \in B)\]

The expectation can also be factorized into respective components

\[E(g(X)h(Y)) = E(g(X)) E(h(Y))\]

Additionally, let \(g(x)\) be a function only of \(x\) and \(h(y)\) be a function only of \(y\), then the random variable \(U=g(x), V=h(y)\) are independent

Theorem Let \(X,Y\) be independent random variables. Let \(g(X)\) be a function only of \(x\) and \(h(y)\) be a function only of \(y\). Then the random variables \(U,V\) are independent

\[U=g(X), V=h(Y)\]

Proposition (law of total probability)

\[P(A) = \int_{-\infty}^{\infty} P(A|X=x) f_X(x) dx\]

Proposition (two continuous random variables)

\[E[g(X,Y)] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x,y) f_{XY}(x,y) dx dy\]

5.3. Bivariate Transformation

The Bivariate transformation is a generalization of the previous single variable transformation. It is also a variant of the muliivariate integrate by substitution (change of variable), for example,

\[\int f(x,y) dxdy = \int f(g(u, v)) |J| dudv\]

Theorem (the method of transformations) Let \(X, Y\) be two jointly continuous random variables. Let \((Z,W) = g(X,Y) = (g_1(X,Y), g_2(X,Y))\) where \(g: R^2 \to R^2\) is a continuous invertible function with continuous partial derivatives. Let \(h = g^{-1}, ie.e., (X,Y) = h(Z,W) = (h_1(Z,W), h_2(Z,W))\) Then \(Z,W\) are jointly continuous and their joint PDF \(f_{ZW}(z,w)\) is given by

\[f_{ZW}(z,w) = f_{XY}(h_1(z,w), h_2(z,w)) |J|\]

where \(J\) is the Jacobian of \(h\)

This can be used to compute multivariate distribution such as \(X+Y, XY\)

X+Y

Let \(X,Y\) be random variables having joint density \(f(x,y)\), then the density function of \(U=X+Y\)

\[g(u) = \int_{-\infty}^{\infty} f(v, u-v)dv\]

by using the linear transformation \(U=X+Y, V=X\)

When \(X,Y\) are independent variables with density function \(f_1, f_2\) respectively, it becomes

\[g(u) = \int_{-\infty}^{\infty} f_1(v)f_2(u-v)dv = f1*f2\]

which is the convolution of \(f_1, f_2\)

Proposition (mgf of independent random variables) Let \(X,Y\) be independent random variables with moments generating functions \(M_X(t), M_Y(t)\). Then the moment generating function of the random variable \(Z = X+Y\) is given by

\[M_Z(t) = M_X(t) M_Y(t)\]

Laplace transform

Recall that moment generating function is kind of a Laplace transform, and in Laplace transform, we can convert convolution into multiplication

\[\mathcal{L}(f(t) * g(t)) = F(s)G(s)\]

In probability, \(Z=X+Y\) represents the convolution, so the multiplication of moment generating function totally makes sense.

5.4. Hierarchical/Mixture Models

The advantage of the hierarchical models is that complicated process might be modeled by a sequence of relatively simple models.

Definition (mixture model) A random variable \(X\) is said to have a mixture distribution if the distribution of \(X\) depends on a quantity that also has a distribution.

Recall \(E(X|y)\) is a function of \(y\) and \(E(X|Y)\) is a random variable whose value depends on \(Y\) (this is similar to the single variable transformation such as \(Y \to Y^2\))

Proposition (law of total expectation) If \(X,Y\) are any two random variables

\[ E[Y] = \int_{-\infty}^{\infty} E[Y|X=x] f_X(x) dx \]
\[ E[Y] = E[E[Y|X]] \]

application of the law total expectation

Suppose we have two random variables \(X,Y\) where

\[X|Y \sim binomial(Y,p)\]
\[Y \sim Poisson(\lambda)\]

We can compute EX as follows

\[EX = E[E[X|Y]] = E[pY] = p\lambda\]

Similarly we can expand the variance with respect to the other random variable.

Proposition (law of total variance)

\[Var(Y) = E[Var(Y|X)] + Var(E[Y|X])\]

proof Let \(V=Var(X|Y), Z=E(X|Y)\), then \(V=E(X^2|Y)-Z^2\), taking E on two sides , we get \(EV = EX^2-EZ^2\) with law of total expectation, notice that \(Var(Z) = EZ^2 - (EZ)^2 = EZ^2 - (EX)^2\), we got the target formula by combining them together.

There is an interesting interpretation in the Bayesian statistics, when \(Y=\theta\)

\[Var(\theta) = E[Var(\theta|X)] + Var(E(\theta|X))\]

This implies

\[ E[Var(\theta|X)] \leq Var(\theta)\]

which means on average, the posterior variance of \(\theta\) given dataset \(X\) is samller than the prior variance.

law of total variance

Consider the following discrete joint distribution

\(Y=0\) \(Y=1\)
\(X=0\) \(1/5\) \(2/5\)
\(X=1\) \(2/5\) 0

we can easily find that \(Var(E(X|Y)) = 8/75, E(Var(X|Y)) = 2/15, Var(X) = 6/25\) which satisfies the law of total variance.

5.5. Covariance and Correlation

The covariance and correlation measure the strength of a kind of linear relationship.

Definition (covariance) The covariance between \(X,Y\) is defined as

\[\text{Cov}(X,Y) = E[(X - EX)(Y-EY)]\]

It can be simplified by

\[ \text{Cov}(X,Y) = E[XY] - (EX)(EY)\]

Definition (correlation coefficient) The correlation of \(X,Y\) is the number defined by

\[\rho_{XY} = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}\]

If we define \(U,V\) as

\[U = \frac{X - EX}{\rho_X}, V = \frac{Y-EY}{\rho_Y}\]

then

\[\rho_{XY} = Cov(U,V)\]

Lemma (properties of covariance)

  • \(Cov(X,X) = Var(X)\)
  • If \(X, Y\) are independent then \(Cov(X,Y)=0\)
  • \(Cov(X,Y) = Cov(Y,X)\)
  • \(Cov(aX,Y) = aCov(X,Y)\)
  • \(Cov(X+c, y)=Cov(X,Y)\)
  • \(Cov(X+Y, Z)=Cov(X,Z)+Cov(Y,Z)\)

We aan summarize them into the linear property:

\[Cov(\sum_i a_iX_i, \sum_j b_j Y_j) = \sum_i \sum_j a_i b_j Cov(X_i, Y_j)\]

Proposition (independence and covariance) If \(X,Y\) are independent random variables, then \(\text{Cov}(X,Y) = 0\)

Proof When \(X,Y\) are independent \(\text{Cov}(X,Y) = EXY - EXEY = EXEY - EXEY = 0\)

However, the opposite is not always true, i.e. covariance does not necessarily means independance. In some special cases, it is true though (see C&B Lemma 5.3.3)

Covariance and correlation measure only a particular kind of linear relationship. To measure the general independence relation, use mutual information instead.

\[I(X,Y) = KL[p(x,y) \| p(x)p(y)]\]

If \(I(X; Y)==0\) then \(X,Y\) are independent.

discrete \((X,Y)\) has covariance 0 but dependent

Consider random variable \(X\) is takes ±1 with 0.5 prob each, \(Y\) is 0 when \(X=0\) and \(Y=+/-1\) when \(X=1\).

It is clearly dependent but \(Cov(X,Y) = 0\)

continuous \((X,Y)\) has covariance 0 but dependent

Cosider random variable \(X \sim N(0, 1), Y=X^2\) they are obviously dependent, but

\[Cov(X,Y) = EXY - EXEY = EX^3 = 0\]

However, they are some limited cases that covariance implies indepdenent

Proposition Let \(X_j \sim n(\mu_j, \sigma^2_j)\) independent, For constants \(a_{i,j}, b_{i,j}\), define

\[U = \sum_j a_{i,j} X_j\]
\[V = \sum_j b_{r,j} X_j\]

The random variable \(U, V\) are independent iff \(Cov(U, V)=0\)

Proposition (variance of a sum)

\[Var(aX+bY) = a^2Var(X) + b^2Var(Y) + 2abCov(X,Y)\]

If \(X,Y\) are independent (or uncorrelated)

\[Var(aX+bY) = a^2Var(X)+b^2Var(Y)\]

Proposition (properties of correlation coefficient)

\[ -1 \leq \rho(X,Y) \leq 1\]
\[\rho(aX+b, cY+d) = \rho(X,Y)\]

5.6. Multivariable Models

Definition (multivariable random vector) The random vector \(X=(X_1, ..., X_n)\) has a sample space that is a subset of \(R^n\). If \((X_1, ..., X_n)\) is a discrete random vector (sample space is countable), the joint pmf of \((X_1, X_2, ..., X_n)\) is defined by

\[f(x) = f(x_1, ..., x_n) = P(X_1=x_1, ..., X_n=x_n)\]

then the for any \(A \subset R^n\)

\[P(X \in A) = \sum_{x \in A} f(x)\]

If \((X_1, ..., X_n)\) is a continuous random vector,

\[P(X \in A) = \int_A f(x_1, ..., x_n) dx_1 ... dx_n\]

Definition (expected value) Let \(g(x) = g(x_1, ..., x_n)\) be a real-valued function defined on the sample space of \(X\). THen \(g(X)\) is a random variable and expected value of \(g(X)\) is

\[Eg(X) = \int g(x) f(x) dx\]

Definition (marginal pdf)

\[f(x_1, ..., x_k) = \int f(x_1, ..., x_n) dx_{k+1} ... dx_n\]

Definition (conditional pdf)

\[f(x_{k+1}, ..., x_n | x_1, ..., x_k) = \frac{f(x_1, ..., x_n)}{f(x_1, ..., x_k)}\]

Definition (mutual indepdenent random vector) Let \(X_1, ..., X_n\) be random vectors with joint pdf/pmf \(f(x_1, ..., x_n\)). Let \(f_{X_i}(x_i)\) denote the marginal pdf/pmf of \(X_i\), then \(X_1, ..., X_n\) are called mutually independent random vectors if for every \(x_1, ..., x_n\)

\[f(x_1, ..., x_n) = f_{X_1}(x_1) \cdot ... \cdot f_{X_n}(x_n) = \prod_{i=1}^n f_{X_i}(x_i)\]

Definition (variance-covariance matrix) The variance-covariance matrix is represented as

\[\mathcal{\Sigma} = Cov(\mathbf{X}) = E\mathbf{(X-\mu)(X-\mu)^T}\]

It is a symmetric matrix.

If we partition \(\mathbf{X}\) into two groups: \(\mathbf{X}^{(1)}, \mathbf{X}^{(2)}\), then the variance-covariance matrix can also be paritioned into components

\[\Sigma = \begin{bmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \\ \end{bmatrix}\]

where \(\Sigma_{12} = Cov(\mathbf{X}^{(1)}, \mathbf{X}^{(2)})\)

Lemma (linear combinations) Suppose \(\mathbf{Z = CX}\) (e.g: \(Z_1 = c_{1,1}X_1 + ... + c_{1, p}X_p\))

then

\[E[\mathbf{Z}] = E[\mathbf{CX}] = CE[\mathbf{X}]\]
\[\mathbf{\Sigma_Z} = Cov(\mathbf{Z}) = \mathbf{C\Sigma_X C^T}\]

Definition (identifiability) The parameterization \(\theta \in \Theta\) is identifiable if \(Y_1 \sim P_{\theta_1}, Y_2 \sim P_{\theta_2}\) and \(Y_1 \sim Y_2\) imply that \(\theta_1 = \theta_2\)

6. Asymptotics

Convergence concepts are useful in approximating finite size sample because their expression can be simplified when taking limits. The relation between convergences are as follows:

\[\text{almost sure convergence} \subset \text{convergence in probability} \subset \text{convergence in distribution}\]

6.1. Almost Sure Convergence

Almost sure convergence is similar to the pointwise convergence \(\lim X_n = X\), except the convergence need not occur on a set with measure 0

\[P(\lim X_n = X) = 1\]

Let's the sample splace \(S\) has elements denoted by \(s\), then \(X_n(s)\) and \(X(s)\) are functions defined on \(S\). This convergence says \(X_n\) converges to \(X\) almost surely if the functions \(X_n(s)\) converges to \(X(s)\) for all \(s \in S\) except a set with measure 0.

Definition (almost sure convergence) A sequence of random variables, \(X_1,X_2, ...\) converges almost surely to a random variable \(X\) if, for every \(\epsilon > 0\)

\[P(\lim_{n \to \infty} |X_n - X| \leq \epsilon) = 1\]

Formally, the almost sure convergence is defined as follows:

Let \(\Omega\) be a set of probability mass \(1\) (\(P(\Omega)=1\)), then for any \(\omega \in \Omega\) and for any \(\epsilon > 0\), there exists a \(N(\omega, \epsilon)\) such that when \(n > N\)

\[|X_n(\omega) - X(\omega)| \leq \epsilon\]

Theorem (Strong Law of Large Numbers) Let \(X_1, X_2, ...\) be iid random variables with \(EX_i = \mu\) and \(Var(X_i) = \sigma^2 < \infty\), then for every \(\epsilon > 0\)

\[P(\lim_{n \to \infty} |\bar{X}_n - \mu| < \epsilon) = 1\]

6.2. Convergence in Probability

Definition (convergence in probability) A sequence of random variables $X_1, X_2, ..., $ converges in probability to a random variable \(X\) if for every \(\epsilon > 0\)

\[\lim_{n \to \infty} P(|X_n - X| \geq \epsilon) = 0\]

Note that \(X_n\) are usually not IID random variable, and \(X\) is common to be a fixed number. The most famous result is the following one:

Theorem (Weak Law of Large Numbers) Let \(X_1, X_2, ...\) be iid random variables with \(EX_i=\mu, Var(X_i) = \sigma^2 < \infty\). Then for very \(\epsilon > 0\)

\[\lim_{n \to \infty} P(|\bar{X}_n - \mu|<\epsilon) = 1\]

that is, \(\bar{X}_n\) converges in probability to \(\mu\)

This theorem can be proved by using the Chebychev's inequality

\[P(|\bar{X}_n - \mu| \geq \epsilon) = \frac{Var(\bar{X})}{\epsilon^2} = \frac{\sigma^2}{n \epsilon^2} \to 0\]

Convergence in probability is highly related to the consistency concept in statistics. Suppose we have an estimator \(\hat{\theta}\) for some quantity \(\theta\), \(\hat{\theta}\) is said to be consistent if it converges to \(\theta\) in probability.

One related useful theorem about consistency is the following one

Theorem (consistency preserved by continuous function) Suppose \(X_1, X_2, ...\) converges in probability to a random variable \(X\) and that \(h\) is a continuous function. Then $h(X_1), h(X_2), ... $ converges in probability to \(h(X)\)

Among those convergence concepts, convergence in distribution is the weakest form.

6.3. Convergence in Quadratic Mean

Definition (convergence in quadratic mean) Sometimes to show a stronger form of convergence is useful to prove convergence in probability. The following one is known as the convergence in quadratic mean \(\(\lim_n E(X_n - X)^2 \to 0\)\)

Convergence in quadratic mean implies convergence in probability because

\[P(|X_n - X| \leq \epsilon) \leq \frac{E(X_n - X)^2}{\epsilon^2} \to 0\]

Intuitively, quadratic mean convergence penalizes the deviation by the square form while the probability convergence penalize deviation by the absolute form, therefore quadratic mean is a stronger form.

qm convergence does not imply p convergence

Consider the random variable \(X_n = \sqrt{n} \mathbf{1}_{[0, 1/n]}(U)\) where \(U\) is uniform, \(X_n\) converges to 0 in probability

\[P(|X_n| \geq \epsilon) \leq 1/n \to 0\]

However, the quadratic mean is not zero

\[E(X_n-X)^2 = EX_n^2 = nP(U \in [0, 1/n]) = 1\]

6.4. Convergence in Distribution

Definition (convergence in distribution) A sequence of random variables \(X_1, X_2, ...\) converges in distribution to a random variable \(X\) if

\[\lim_{n \to \infty} F_{X_n}(x) = F_X(x)\]

at all points where \(F_X(x)\) is continuous

6.5. Central Limit Theorem

Theorem (Central Limit Theorem, classical CLT) Let \(X_1, X_2, ...\) be a sample whose mgs exist in a neighborhood of \(0\). Let \(EX_i = \mu, Var(X_i) = \sigma^2\) and

\[G_n(X) = \frac{\sqrt{n}(\bar{X}-\mu)}{\sigma}\]

then \(G_n(x)\) converges to standard normal distribution

Instead of the true variance \(\sigma^2\) , we can use the estimated variance \(S^2\), it still has the CLT using Slutsky and continuous mapping theorem

\[\frac{\sqrt{n}(\bar{X} - \mu)}{S} \to N(0, 1)\]

When \(X_i\) are independent, but not identically distributed, if the moments of some order $2+\delta $ is bounded, then we still has the CLT

Theorem (Lyapunov CLT) Suppose \(X_1, ..., X_n\) is a sequence of independent random variables with mean \(\mu_i\) and variance \(\sigma_i^2\). Let \(s^2_n = \sum_i \sigma^2_i\), if for some \(\delta > 0\), the Lyapunov condition is satisfied:

\[\lim_n \frac{1}{s_n^{2+\delta}} \sum_i E|X_i - \mu_i|^{2+\delta} = 0\]

then we have the CLT

\[\frac{\sum (X_i - \mu_i)}{s_n} \to N(0, 1)\]

Note there is another related (weaker) condition called Linderberg CLT

Theorem (Multivariate CLT) If \(X_1, ..., X_n\) are iid with mean \(\mu\) and covariance matrix \(\Sigma\), then

\[\sqrt{n}(\hat{\mu} - \mu) \to N(0, \Sigma)\]

The rate of CLT convergence is roughly at \(1/\sqrt{n}\)

Theorem (Berry-Essen) Suppose \(X_1, ..., X_n\) are iid and has mean \(\mu\), variance \(\sigma^2\) and 3rd moment \(\mu_3 = E|X - \mu|^3\). let

\[F_n(x) = P(\frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} < x)\]

then

\[\sup_x | F_n(x) - \Phi(x) | \leq \frac{9\mu_3}{\sigma^3\sqrt{n}}\]

6.6. Delta Method

Suppose we have a sequence of random variable \(X_i\) converging to normal distribution, we can also characterize the limiting distribution of \(g(X_i)\) where \(g\) is a smooth function

Theorem (delta method) Suppose

\[\frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} \to N(0, 1)\]

Then

\[\frac{\sqrt{n}(g(\bar{X}) - g(\mu))}{\sigma} \to N(0, [g'(\mu)]^2)\]

7. Reference

  • [1] Pishro-Nik, Hossein. "Introduction to probability, statistics, and random processes." (2016).
  • [2] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.
  • [3] Axler, Sheldon. Measure, Integration & Real Analysis. Springer Nature, 2020.APA
  • [4] Çınlar, Erhan. Probability and stochastics. Vol. 261. Springer Science & Business Media, 2011.