# 0x040 Probability

- 1. Measure Theory
- 2. Law of Large Numbers
- 3. Central Limit Theorems
- 4. Univariate models
- 5. Multivariate Models
- 6. Asymptotics
- 7. Reference

This note is a mixture of measure-based and non-measure-based probability.

## 1. Measure Theory

Details of measure theory will be described in the real analysis note

### 1.1. Probability Space

**Definition (sample space)** \(\Omega\) is called the sample space. Intuitively, it is the set of possible outcomes of an experiment.

Unfortunately, we can not assign measure to every subset of sample space in general, therefore, we only consider the event space which is a subset of all subsets.

**Definition (event space)** \(\mathcal{F}\) is called the event space when \(\mathcal{F}\) is a \(\sigma\)-algebra on a set \(\Omega\). An event is an element of \(\mathcal{F}\)

On this proper event space, we can define the probability measure

**Definition (probability measure)** A probability measure on \((\Omega, \mathcal{F})\) is a measure \(P\) on \((\Omega, \mathcal{F})\) such that \(P(\Omega)=1\). If \(A\) is an event, then \(P(A)\) is called the probability of \(A\)

Recall the measure has the following properties, let \(\mu\) be a measure on \((\Omega, \mathcal{F})\)

- (monotonicity) If \(A \subset B\), then \(\mu(A) \leq \mu(B)\)
- (subadditivty) If \(A \subset \cup_{m=1}^{\infty} A_m\), then \(\mu(A) \leq \sigma_{m=1}^\infty \mu(A_m)\)
- (continuity from below) If \(A_i \uparrow A\) (i.e. $A_1 \subset A_2 \subset ... $), then \(\mu(A_i ) \uparrow \mu(A)\)
- (continuity from above) If \(A_i \downarrow A\), then \(\mu(A_i) \downarrow \mu(A)\)

With these three components, we can define the probability space

**Definition (probability space, Kolmogorov)** If \(P\) is a probability measure on \((\Omega, \mathcal{F})\), then the triplet \((\Omega, \mathcal{F}, P)\) is called a probability space.

discrete probability space

Let \(\Omega\) be a countable set, \(\mathcal{F}\) be the sets of all subsets of \(\Omega\) and

where \(\sum_{\omega \in \Omega} p(\omega) = 1\)

Frequentist and Bayesian interpretation

There are two interpretations \(P(A)\). The two common interpretations are frequencies and degrees of beliefs (Bayesian).

- Frequentist says that \(P(A)\) is the long run proportion of times that \(A\) is true in repetition.
- The degree of belief interpretation says that \(P(A)\) is the observer's strength of belief that \(A\) is true.

Note that probability in quantum mechanics probably does not belong to either of these interpretations.

### 1.2. Distribution

random variable is neither random nor a variable

**Definition (Random Variable)** Suppose \((\Omega, \mathcal{F}, P)\) is a probability space. A random variable on \((\Omega, \mathcal{F})\) is a measurable function from \(\Omega\) to \(R\).

Intuitively, a random variable is a function from the sample space to another sample space (i.e. R). Note that random variable can even be defined to project to measurable space other than \((R, B)\).

trivial random variable

If \(\Omega\) is a discrete probability space, then any function \(X: \Omega \to R\) is a random variable

A random variable is the indicator function of a set \(A \in \mathcal{F}\) iff

**Definition (distribution)** If \(X\) is a random variable, then \(X\) induces a probability measure on \(R\) called its distribution by setting

**Defintion (distribution function)** the distribution of a random variable \(X\) is usually descirbed by its distribution function

Every distribution function has the following properties. those are all following simple properties of measure

- \(F\) is nondecreasing
- \(\lim_{x \to \infty} F(x) = 1, \lim_{x \to -\infty} F(x) = 0\)
- \(F\) is right-continuous: \(\lim_{y \downarrow x} F(y) = F(x)\)
- Let \(F(x-) = \lim_{y \uparrow x} F(y)\), then \(F(x-) = P(X < x)\)
- \(P(X=x) = F(x) - F(x-)\)

Conversely, if a function satisfies the top 3 properties, then is is the distribution function of some random variables

### 1.3. Conditional Probability

Conditional Probability All probabilities are calculated with respect to a sample space, but in many cases, we are in a position to update the sample space with new information. In this case, we use conditional probability.

**Definition (Conditional Probability)** If \(A,B\) are two events in a sample space and if \(P(B) > 0\) then the conditional probability of \(A\) given \(B\) is

Note that \(B\) becomes that sample space here. In particular \(P(\dot | B)\) is a probability (satisfying Kolmogorov's axioms)

For any fixed \(B\) such that \(P(B) > 0\), \(P( \cdot | B)\) is a probability measure (satisfying three axioms of probability)

**prosecutor's fallacy**: fallacy from misunderstanding of \(P(A|B) \neq P(B|A)\)

**Lemma** : for any pair of events \(A\) and \(B\)

**Theorem (The Law of Total Probability)** Let \(A_1, ..., A_k\) be a partition of \(\Sigma\), Then for any event \(B\)

**Theorem (Bayes' Theorem)**

Independent Events
**Definition (Independence)** Two events \(A\) and \(B\) are independent if

Independence can arise in two distinct ways

- explicitly assume independence
- derive independence by verifying the previous definition

Note that disjoint events with positive probability is not independent.

Mutual independence is a much stronger assumption. Pairwise independence for all pairs does not imply mutual independence.

**Definition (mutual independence)** A collection of events \(A_1, ..., A_n\) are mutually independent iff for any subcollection \(A_{i_1}, ..., A_{i_k}\)

### 1.4. Random variable

random variable is neither random nor a variable

**Definition (Random Variable)** Suppose \((\Omega, \mathcal{F}, P)\) is a probability space. A random variable on \((\Omega, \mathcal{F})\) is a measurable function from \(\Omega\) to \(R\). Intuitively, a random variable is a function from the sample space to another sample space (i.e. R)

Note that random variable can even be defined to project to measurable space other than \((R, B)\).

**Definition ((more general) random variable)** Let \((E, \mathcal{E})\) be a measurable space. A mapping \(X: \Omega \to E\) is called a random variable if \(X\) is a measurable function with respect to \(\mathcal{F}\) and \(\mathcal{E}\), which means

**Definition (induced probability function)** The induced probability function with respect to the original function is defined as

Note that this is a formal probability distribution, which means it satisfies Kolmogorov's axioms

Note that \(X\) is a discrete random variable if its range is countable

## 2. Law of Large Numbers

### 2.1. Independence

Measure theory ends and probability begins with the definition of independence.

**Definition (independence)**

- Two events \(A\), \(B\) are independent if \(P(A \cap B) = P(A) P(B)\)
- Two random variables \(X,Y\) are independent if for all \(C, D \in R\), \(P(X \in C, Y \in D) = P(X \in C)P(Y \in D)\)
- two \(\sigma\)-fields \(\mathcal{F}, \mathcal{G}\) are independent if for all \(A \in \mathcal{F}, B \in \mathcal{G}\) the events are independence

### 2.2. Weak Law of Large Numbers

### 2.3. Borel-Cantelli Lemmas

### 2.4. Strong Law of Large Numbers

## 3. Central Limit Theorems

## 4. Univariate models

### 4.1. Transformation

**Definition (transformation)** If \(X\) is a random variable, then any function of \(X\), \(g(X)\) is also a random variable (if \(g\) is a Borel measurable function), then probability distribution of \(Y\) is defined by

**Corollary (transformation of support)** It is important to keep track of the sample spaces of \(X\) and \(Y\), the support of \(Y\) is

**Corollary (monotone transformation of cdf)** If \(X\) have cdf \(F_X(x)\), let \(Y=g(X)\)

if \(g\) is an increasing function, then

if \(g\) is a decreasing function, then

By taking derivative of both sides, we obtain the transformation rules of pdf for monotone functions.

Note this is a variant of the integration by substitution (derived from the fundamental theorem of calculus) where \(g^{-1} = \varphi\)

**Theorem (monotone transformation of pdf)** Let \(X\) have pdf \(f_X(x)\) and \(Y=g(X)\), where \(g\) is a monotone function. Suppose \(f_X(x)\) is continuous and \(g^{-1}(y)\) has a continuous derivative on \(\mathcal{Y}\), then

Intuitively, the discussion above is simply

therefore, we get

Note

this only applies to the monotone functions, for functions that are not monotone (e.g.: \(Y=X^2\)), we need to compute a partition of \(X\) into where each \(X_i\) is monotone over \(g(X)\), then sum the inverse density \(f_X(g^{-1}(y))\) with its weight \(\frac{dg^{-1}(y)}{dy}\).

### 4.2. Expectation and Variance

**Definition (expectation)** Formally, Suppose \((\Omega, \mathcal{F}, P)\) is a probability space, If \(X \in \mathcal{L}^1 (P)\), then the expectation of the random variable \(X\) is devoted \(EX\) and defined by

When \(X\) is a discrete random variable with range $R_X = { x_1, x_2, ... } $ (finite or countably infinite). The expected value of \(X\), denoted by \(EX\) is defined as

Note that expectation does not always exist for any distribution, for example, the Cauchy distribution does have an expectation

**Theorem (linearity of expectation)**
$$ E(aX + b) = aE(X) + b$$

**Theorem (expectation of transformation)** There are two ways to compute \(E[g(X)]\). One way is to compute PMF of \(Y = g(X)\), the other one is using follows (easier)

### 4.3. Moments

Moments reflects characteristics of distributions, however, the set of infinite moments is not enough to character the distribution. Two distinct random variables might have same moments set. To characterize distribution, both random variables have to have bounded support.

**Definition (moment, central moment)** For each integer \(n\), the \(n\)-th moment of \(X\), \(\mu_n\) is

The \(n\)th central moment of \(X\), \(\mu_n\) is

The 2^{nd} central moment is the variance defined as follows

**Definition (Variance)** The variance of a random variable \(X\), with mean \(EX = \mu_X\) is defined as

The standard deviation of a random variable \(X\) is defined as

A simple way to compute variance is as follows

**Lemma (relationship between moments)** The previous \(Var(X)\) can be written as

The 3^{rd} moment and 4^{th} moment have similar relationship

**Proposition (algebra of variance)**

If \(X, Y\) are independent

**Definition (Standardized moment)** The standardized moment is the normalized central moment defined as

The 3^{rd} standardized moment is called the skewness, which measures the lack of symmetry

The 4^{th} standardized moment is called the kurtosis, which measures the peakedness of the pdf

While the moments might not be efficient to characterize distributions, the following moment generating function can characterize distributions if it exists

**Definition (moment generating function, mgf)** The moment generating function of \(X\), denoted by \(M_X(t)\) is following, provided that expectation exist for \(t\) in some neighborhood of 0.

Note: \(M_X(t)\) is the Laplace transform of \(f_X(x)\)

**Lemma (algebra over mgf)**

The moment generation function is called as it is because it can be used to generate moments by differentiation.

**Theorem (moment generation)** If \(X\) has mgf \(M_X(t)\), it can generate moments as follows

Note that the main use of the mgf is not to generate moments, but to characterize distributions, this characterizes an infinite set of moments. However, characterizing a infinite set of moments is not enough to determine a distribution uniquely. Two different distribution might have same moments.

To uniquely determine moments, we require the bounded support.

**Theorem (determinations of distribution)** Let \(F_X(x), F_Y(y)\) be two cdfs all of whose moments exist, If the moment generating function exists and \(M_X(t)=M_Y(t)\) for all \(t\) in some neighborhood of 0, then

**Theorem (convergence of mgfs)** Convergence for \(|t| < h\) of mgfs to an mgf implies convergence of pdfs

While moment generating functions might not always exist, the characteristic function always exist and also characterize the random variable

**Definition (characteristic function)** The characteristic function for a random variable is defined as

## 5. Multivariate Models

The probability models that involve more than one random variable are called *multivariate models*.

**Definition (n-dimensional random vector)** An n-dimensional random vector is a function from a sample space \(S\) in to \(R^n\), n-dimensional Euclidean space.

### 5.1. Joint and Marginal Distributions

The random vector is called a *discrete random vector* when it has only a countable number of possible values, otherwise it is called a *continuous random vector*.

**Definition (joint PMF)** Let \((X,Y)\) be a discrete bivariate random vector. Then the function \(f(x,y): R^2 \to R\) defined by \(f(x,y) = P(X=x, Y=y)\) is called the joint probability mass function or joint pmf of \((X,Y)\).

The joint pmf can be used to compute the probability of any event.

**Definition (marginal PMF)** Let \((X,Y)\) be a discrete bivariate random vector with joint pmf \(f_{X,Y}(x,y)\). Then the marginal pmfs of \(X\), \(f_X(x) = P(X = x)\) is given by

**Definition (joint PDF)** A function \(f(x,y): R^2 \to R\) is called a joint probability function or joint pdf of the continuous bivariate random vector \((X,Y)\) if for every event \(A \in R^2\),

**Definition (marginal PDF)** The marginal probability density function of \(X,Y\) are also defined as in the discrete case with integrals replacing sums.

**Definition (joint CDF)** The joint distribution of \((X,Y)\) can also be completely described with the joint cdf

### 5.2. Conditioning and Independence

Oftentimes when two random variables \((X,Y)\) are observed, the values of the two variables are related. Knowledge about the value of \(X\) gives us some infomation about the value of \(Y\).

**Definition (conditional pmf, pdf)** Let \((X,Y)\) be a discrete/continuous bivariate random vector with joint pmf/pdf \(f(x,y)\) and marginal pmfs/pdfs \(f_X(x), f_Y(y)\), the conditional pmf/pdf of \(Y\) given that \(X=x\) is the function of \(y\) denoted by \(f(y|x)\)

Note that this is a valid probability with respect to \(y\).

Since \(Y|X=x\) is a valid random variable, we can compute expectation of any function of \(Y\) \(g(Y)\)

**Definition (Conditional expectation)** The conditional expected value of \(g(Y)\) given that \(X=x\) is denoted by \(E(g(Y)|x)\)

Note that this is a function of \(x\)

Similarly, we can compute the conditional variance of \(Y|x\).

The conditional distribution of \(Y\) given \(X=x\) is possibly a different prob distribution for each \(x\), therefore we have a family of prob distribution for \(Y\) for each \(x\), when we wish to describe this entire family, we use the phrase "the distribution of \(Y|X\)

In some situations, the knowledge that \(X=x\) does not give us any information about \(Y\), this relationship is called independence.

**Definition (independence)** Let \((X,Y)\) be a bivariate random vector with joint pdf or pmf \(f(x,y)\) and marginal pdfs or pmfs \(f_X(x), f_Y(y)\). Then \(X,Y\) are called independent random variables if for every \(x, y \in R\)

If they are independent

To check that two random variables are independent, one way is to check **all** \(x, y \in R\) combinations. This require the knowledge of \(f_X(x), f_Y(y)\), which is sometimes difficult.

**Criterion (joint pdf factorization)** Another good criterion is to check whether the joint distribution \(f(x,y)\) can be factorized into two components as follows

Those independence can simplifying computations as follows

**Theorem (independent computing)** Suppose that \(X,Y\) are independent random variables, then their events are also independent which means

The expectation can also be factorized into respective components

Additionally, let \(g(x)\) be a function only of \(x\) and \(h(y)\) be a function only of \(y\), then the random variable \(U=g(x), V=h(y)\) are independent

**Theorem** Let \(X,Y\) be independent random variables. Let \(g(X)\) be a function only of \(x\) and \(h(y)\) be a function only of \(y\). Then the random variables \(U,V\) are independent

**Proposition (law of total probability)**

**Proposition (two continuous random variables)**

### 5.3. Bivariate Transformation

The Bivariate transformation is a generalization of the previous single variable transformation. It is also a variant of the muliivariate integrate by substitution (change of variable), for example,

**Theorem (the method of transformations)** Let \(X, Y\) be two jointly continuous random variables. Let \((Z,W) = g(X,Y) = (g_1(X,Y), g_2(X,Y))\) where \(g: R^2 \to R^2\) is a continuous invertible function with continuous partial derivatives. Let \(h = g^{-1}, ie.e., (X,Y) = h(Z,W) = (h_1(Z,W), h_2(Z,W))\) Then \(Z,W\) are jointly continuous and their joint PDF \(f_{ZW}(z,w)\) is given by

where \(J\) is the Jacobian of \(h\)

This can be used to compute multivariate distribution such as \(X+Y, XY\)

X+Y

Let \(X,Y\) be random variables having joint density \(f(x,y)\), then the density function of \(U=X+Y\)

by using the linear transformation \(U=X+Y, V=X\)

When \(X,Y\) are independent variables with density function \(f_1, f_2\) respectively, it becomes

which is the convolution of \(f_1, f_2\)

**Proposition (mgf of independent random variables)** Let \(X,Y\) be independent random variables with moments generating functions \(M_X(t), M_Y(t)\). Then the moment generating function of the random variable \(Z = X+Y\) is given by

Laplace transform

Recall that moment generating function is kind of a Laplace transform, and in Laplace transform, we can convert convolution into multiplication

In probability, \(Z=X+Y\) represents the convolution, so the multiplication of moment generating function totally makes sense.

### 5.4. Hierarchical/Mixture Models

The advantage of the hierarchical models is that complicated process might be modeled by a sequence of relatively simple models.

**Definition (mixture model)** A random variable \(X\) is said to have a mixture distribution if the distribution of \(X\) depends on a quantity that also has a distribution.

Recall \(E(X|y)\) is a function of \(y\) and \(E(X|Y)\) is a random variable whose value depends on \(Y\) (this is similar to the single variable transformation such as \(Y \to Y^2\))

**Proposition (law of total expectation)** If \(X,Y\) are any two random variables

application of the law total expectation

Suppose we have two random variables \(X,Y\) where

We can compute EX as follows

Similarly we can expand the variance with respect to the other random variable.

**Proposition (law of total variance)**

*proof* Let \(V=Var(X|Y), Z=E(X|Y)\), then \(V=E(X^2|Y)-Z^2\), taking E on two sides , we get \(EV = EX^2-EZ^2\) with law of total expectation, notice that \(Var(Z) = EZ^2 - (EZ)^2 = EZ^2 - (EX)^2\), we got the target formula by combining them together.

There is an interesting interpretation in the Bayesian statistics, when \(Y=\theta\)

This implies

which means on average, the posterior variance of \(\theta\) given dataset \(X\) is samller than the prior variance.

law of total variance

Consider the following discrete joint distribution

\(Y=0\) | \(Y=1\) | |
---|---|---|

\(X=0\) | \(1/5\) | \(2/5\) |

\(X=1\) | \(2/5\) | 0 |

we can easily find that \(Var(E(X|Y)) = 8/75, E(Var(X|Y)) = 2/15, Var(X) = 6/25\) which satisfies the law of total variance.

### 5.5. Covariance and Correlation

The covariance and correlation measure the strength of a kind of linear relationship.

**Definition (covariance)** The covariance between \(X,Y\) is defined as

It can be simplified by

**Definition (correlation coefficient)** The correlation of \(X,Y\) is the number defined by

If we define \(U,V\) as

then

**Lemma (properties of covariance)**

- \(Cov(X,X) = Var(X)\)
- If \(X, Y\) are independent then \(Cov(X,Y)=0\)
- \(Cov(X,Y) = Cov(Y,X)\)
- \(Cov(aX,Y) = aCov(X,Y)\)
- \(Cov(X+c, y)=Cov(X,Y)\)
- \(Cov(X+Y, Z)=Cov(X,Z)+Cov(Y,Z)\)

We aan summarize them into the linear property:

**Proposition (independence and covariance)** If \(X,Y\) are independent random variables, then \(\text{Cov}(X,Y) = 0\)

*Proof* When \(X,Y\) are independent \(\text{Cov}(X,Y) = EXY - EXEY = EXEY - EXEY = 0\)

However, the opposite is not always true, i.e. covariance does not necessarily means independance. In some special cases, it is true though (see C&B Lemma 5.3.3)

Covariance and correlation measure only a particular kind of *linear* relationship. To measure the general independence relation, use mutual information instead.

If \(I(X; Y)==0\) then \(X,Y\) are independent.

discrete \((X,Y)\) has covariance 0 but dependent

Consider random variable \(X\) is takes ±1 with 0.5 prob each, \(Y\) is 0 when \(X=0\) and \(Y=+/-1\) when \(X=1\).

It is clearly dependent but \(Cov(X,Y) = 0\)

continuous \((X,Y)\) has covariance 0 but dependent

Cosider random variable \(X \sim N(0, 1), Y=X^2\) they are obviously dependent, but

However, they are some limited cases that covariance implies indepdenent

**Proposition** Let \(X_j \sim n(\mu_j, \sigma^2_j)\) independent, For constants \(a_{i,j}, b_{i,j}\), define

The random variable \(U, V\) are independent iff \(Cov(U, V)=0\)

**Proposition (variance of a sum)**

If \(X,Y\) are independent (or uncorrelated)

**Proposition (properties of correlation coefficient)**

### 5.6. Multivariable Models

**Definition (multivariable random vector)** The random vector \(X=(X_1, ..., X_n)\) has a sample space that is a subset of \(R^n\). If \((X_1, ..., X_n)\) is a discrete random vector (sample space is countable), the joint pmf of \((X_1, X_2, ..., X_n)\) is defined by

then the for any \(A \subset R^n\)

If \((X_1, ..., X_n)\) is a continuous random vector,

**Definition (expected value)** Let \(g(x) = g(x_1, ..., x_n)\) be a real-valued function defined on the sample space of \(X\). THen \(g(X)\) is a random variable and expected value of \(g(X)\) is

**Definition (marginal pdf)**

**Definition (conditional pdf)**

**Definition (mutual indepdenent random vector)** Let \(X_1, ..., X_n\) be random vectors with joint pdf/pmf \(f(x_1, ..., x_n\)). Let \(f_{X_i}(x_i)\) denote the marginal pdf/pmf of \(X_i\), then \(X_1, ..., X_n\) are called *mutually independent random vectors* if for every \(x_1, ..., x_n\)

**Definition (variance-covariance matrix)** The variance-covariance matrix is represented as

It is a symmetric matrix.

If we partition \(\mathbf{X}\) into two groups: \(\mathbf{X}^{(1)}, \mathbf{X}^{(2)}\), then the variance-covariance matrix can also be paritioned into components

where \(\Sigma_{12} = Cov(\mathbf{X}^{(1)}, \mathbf{X}^{(2)})\)

**Lemma (linear combinations)** Suppose \(\mathbf{Z = CX}\) (e.g: \(Z_1 = c_{1,1}X_1 + ... + c_{1, p}X_p\))

then

**Definition (identifiability)** The parameterization \(\theta \in \Theta\) is identifiable if \(Y_1 \sim P_{\theta_1}, Y_2 \sim P_{\theta_2}\) and \(Y_1 \sim Y_2\) imply that \(\theta_1 = \theta_2\)

## 6. Asymptotics

Convergence concepts are useful in approximating finite size sample because their expression can be simplified when taking limits. The relation between convergences are as follows:

### 6.1. Almost Sure Convergence

Almost sure convergence is similar to the pointwise convergence \(\lim X_n = X\), except the convergence need not occur on a set with measure 0

Let's the sample splace \(S\) has elements denoted by \(s\), then \(X_n(s)\) and \(X(s)\) are functions defined on \(S\). This convergence says \(X_n\) converges to \(X\) almost surely if the functions \(X_n(s)\) converges to \(X(s)\) for all \(s \in S\) except a set with measure 0.

**Definition (almost sure convergence)** A sequence of random variables, \(X_1,X_2, ...\) converges almost surely to a random variable \(X\) if, for every \(\epsilon > 0\)

Formally, the almost sure convergence is defined as follows:

Let \(\Omega\) be a set of probability mass \(1\) (\(P(\Omega)=1\)), then for any \(\omega \in \Omega\) and for any \(\epsilon > 0\), there exists a \(N(\omega, \epsilon)\) such that when \(n > N\)

**Theorem (Strong Law of Large Numbers)** Let \(X_1, X_2, ...\) be iid random variables with \(EX_i = \mu\) and \(Var(X_i) = \sigma^2 < \infty\), then for every \(\epsilon > 0\)

### 6.2. Convergence in Probability

**Definition (convergence in probability)** A sequence of random variables $X_1, X_2, ..., $ converges in probability to a random variable \(X\) if for every \(\epsilon > 0\)

Note that \(X_n\) are usually not IID random variable, and \(X\) is common to be a fixed number. The most famous result is the following one:

**Theorem (Weak Law of Large Numbers)** Let \(X_1, X_2, ...\) be iid random variables with \(EX_i=\mu, Var(X_i) = \sigma^2 < \infty\). Then for very \(\epsilon > 0\)

that is, \(\bar{X}_n\) converges in probability to \(\mu\)

This theorem can be proved by using the Chebychev's inequality

Convergence in probability is highly related to the consistency concept in statistics. Suppose we have an estimator \(\hat{\theta}\) for some quantity \(\theta\), \(\hat{\theta}\) is said to be consistent if it converges to \(\theta\) in probability.

One related useful theorem about consistency is the following one

**Theorem (consistency preserved by continuous function)** Suppose \(X_1, X_2, ...\) converges in probability to a random variable \(X\) and that \(h\) is a continuous function. Then $h(X_1), h(X_2), ... $ converges in probability to \(h(X)\)

Among those convergence concepts, convergence in distribution is the weakest form.

### 6.3. Convergence in Quadratic Mean

**Definition (convergence in quadratic mean)** Sometimes to show a stronger form of convergence is useful to prove convergence in probability. The following one is known as the convergence in quadratic mean
\(\(\lim_n E(X_n - X)^2 \to 0\)\)

Convergence in quadratic mean implies convergence in probability because

Intuitively, quadratic mean convergence penalizes the deviation by the square form while the probability convergence penalize deviation by the absolute form, therefore quadratic mean is a stronger form.

qm convergence does not imply p convergence

Consider the random variable \(X_n = \sqrt{n} \mathbf{1}_{[0, 1/n]}(U)\) where \(U\) is uniform, \(X_n\) converges to 0 in probability

However, the quadratic mean is not zero

### 6.4. Convergence in Distribution

**Definition (convergence in distribution)** A sequence of random variables \(X_1, X_2, ...\) converges in distribution to a random variable \(X\) if

at all points where \(F_X(x)\) is continuous

### 6.5. Central Limit Theorem

**Theorem (Central Limit Theorem, classical CLT)** Let \(X_1, X_2, ...\) be a sample whose mgs exist in a neighborhood of \(0\). Let \(EX_i = \mu, Var(X_i) = \sigma^2\) and

then \(G_n(x)\) converges to standard normal distribution

Instead of the true variance \(\sigma^2\) , we can use the estimated variance \(S^2\), it still has the CLT using Slutsky and continuous mapping theorem

When \(X_i\) are independent, but not identically distributed, if the moments of some order $2+\delta $ is bounded, then we still has the CLT

**Theorem (Lyapunov CLT)** Suppose \(X_1, ..., X_n\) is a sequence of independent random variables with mean \(\mu_i\) and variance \(\sigma_i^2\). Let \(s^2_n = \sum_i \sigma^2_i\), if for some \(\delta > 0\), the Lyapunov condition is satisfied:

then we have the CLT

Note there is another related (weaker) condition called Linderberg CLT

**Theorem (Multivariate CLT)** If \(X_1, ..., X_n\) are iid with mean \(\mu\) and covariance matrix \(\Sigma\), then

The rate of CLT convergence is roughly at \(1/\sqrt{n}\)

**Theorem (Berry-Essen)** Suppose \(X_1, ..., X_n\) are iid and has mean \(\mu\), variance \(\sigma^2\) and 3^{rd} moment \(\mu_3 = E|X - \mu|^3\). let

then

### 6.6. Delta Method

Suppose we have a sequence of random variable \(X_i\) converging to normal distribution, we can also characterize the limiting distribution of \(g(X_i)\) where \(g\) is a smooth function

**Theorem (delta method)** Suppose

Then

## 7. Reference

- [1] Pishro-Nik, Hossein. "Introduction to probability, statistics, and random processes." (2016).
- [2] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.
- [3] Axler, Sheldon. Measure, Integration & Real Analysis. Springer Nature, 2020.APA
- [4] Çınlar, Erhan. Probability and stochastics. Vol. 261. Springer Science & Business Media, 2011.