0x040 Probability
- 1. Measure Theory
- 2. Law of Large Numbers
- 3. Central Limit Theorems
- 4. Univariate models
- 5. Multivariate Models
- 6. Asymptotics
- 7. Reference
This note is a mixture of measure-based and non-measure-based probability.
1. Measure Theory
Details of measure theory will be described in the real analysis note
1.1. Probability Space
Definition (sample space) \(\Omega\) is called the sample space. Intuitively, it is the set of possible outcomes of an experiment.
Unfortunately, we can not assign measure to every subset of sample space in general, therefore, we only consider the event space which is a subset of all subsets.
Definition (event space) \(\mathcal{F}\) is called the event space when \(\mathcal{F}\) is a \(\sigma\)-algebra on a set \(\Omega\). An event is an element of \(\mathcal{F}\)
On this proper event space, we can define the probability measure
Definition (probability measure) A probability measure on \((\Omega, \mathcal{F})\) is a measure \(P\) on \((\Omega, \mathcal{F})\) such that \(P(\Omega)=1\). If \(A\) is an event, then \(P(A)\) is called the probability of \(A\)
Recall the measure has the following properties, let \(\mu\) be a measure on \((\Omega, \mathcal{F})\)
- (monotonicity) If \(A \subset B\), then \(\mu(A) \leq \mu(B)\)
- (subadditivty) If \(A \subset \cup_{m=1}^{\infty} A_m\), then \(\mu(A) \leq \sigma_{m=1}^\infty \mu(A_m)\)
- (continuity from below) If \(A_i \uparrow A\) (i.e. $A_1 \subset A_2 \subset ... $), then \(\mu(A_i ) \uparrow \mu(A)\)
- (continuity from above) If \(A_i \downarrow A\), then \(\mu(A_i) \downarrow \mu(A)\)
With these three components, we can define the probability space
Definition (probability space, Kolmogorov) If \(P\) is a probability measure on \((\Omega, \mathcal{F})\), then the triplet \((\Omega, \mathcal{F}, P)\) is called a probability space.
discrete probability space
Let \(\Omega\) be a countable set, \(\mathcal{F}\) be the sets of all subsets of \(\Omega\) and
where \(\sum_{\omega \in \Omega} p(\omega) = 1\)
Frequentist and Bayesian interpretation
There are two interpretations \(P(A)\). The two common interpretations are frequencies and degrees of beliefs (Bayesian).
- Frequentist says that \(P(A)\) is the long run proportion of times that \(A\) is true in repetition.
- The degree of belief interpretation says that \(P(A)\) is the observer's strength of belief that \(A\) is true.
Note that probability in quantum mechanics probably does not belong to either of these interpretations.
1.2. Distribution
random variable is neither random nor a variable
Definition (Random Variable) Suppose \((\Omega, \mathcal{F}, P)\) is a probability space. A random variable on \((\Omega, \mathcal{F})\) is a measurable function from \(\Omega\) to \(R\).
Intuitively, a random variable is a function from the sample space to another sample space (i.e. R). Note that random variable can even be defined to project to measurable space other than \((R, B)\).
trivial random variable
If \(\Omega\) is a discrete probability space, then any function \(X: \Omega \to R\) is a random variable
A random variable is the indicator function of a set \(A \in \mathcal{F}\) iff
Definition (distribution) If \(X\) is a random variable, then \(X\) induces a probability measure on \(R\) called its distribution by setting
Defintion (distribution function) the distribution of a random variable \(X\) is usually descirbed by its distribution function
Every distribution function has the following properties. those are all following simple properties of measure
- \(F\) is nondecreasing
- \(\lim_{x \to \infty} F(x) = 1, \lim_{x \to -\infty} F(x) = 0\)
- \(F\) is right-continuous: \(\lim_{y \downarrow x} F(y) = F(x)\)
- Let \(F(x-) = \lim_{y \uparrow x} F(y)\), then \(F(x-) = P(X < x)\)
- \(P(X=x) = F(x) - F(x-)\)
Conversely, if a function satisfies the top 3 properties, then is is the distribution function of some random variables
1.3. Conditional Probability
Conditional Probability All probabilities are calculated with respect to a sample space, but in many cases, we are in a position to update the sample space with new information. In this case, we use conditional probability.
Definition (Conditional Probability) If \(A,B\) are two events in a sample space and if \(P(B) > 0\) then the conditional probability of \(A\) given \(B\) is
Note that \(B\) becomes that sample space here. In particular \(P(\dot | B)\) is a probability (satisfying Kolmogorov's axioms)
For any fixed \(B\) such that \(P(B) > 0\), \(P( \cdot | B)\) is a probability measure (satisfying three axioms of probability)
prosecutor's fallacy: fallacy from misunderstanding of \(P(A|B) \neq P(B|A)\)
Lemma : for any pair of events \(A\) and \(B\)
Theorem (The Law of Total Probability) Let \(A_1, ..., A_k\) be a partition of \(\Sigma\), Then for any event \(B\)
Theorem (Bayes' Theorem)
Independent Events Definition (Independence) Two events \(A\) and \(B\) are independent if
Independence can arise in two distinct ways
- explicitly assume independence
- derive independence by verifying the previous definition
Note that disjoint events with positive probability is not independent.
Mutual independence is a much stronger assumption. Pairwise independence for all pairs does not imply mutual independence.
Definition (mutual independence) A collection of events \(A_1, ..., A_n\) are mutually independent iff for any subcollection \(A_{i_1}, ..., A_{i_k}\)
1.4. Random variable
random variable is neither random nor a variable
Definition (Random Variable) Suppose \((\Omega, \mathcal{F}, P)\) is a probability space. A random variable on \((\Omega, \mathcal{F})\) is a measurable function from \(\Omega\) to \(R\). Intuitively, a random variable is a function from the sample space to another sample space (i.e. R)
Note that random variable can even be defined to project to measurable space other than \((R, B)\).
Definition ((more general) random variable) Let \((E, \mathcal{E})\) be a measurable space. A mapping \(X: \Omega \to E\) is called a random variable if \(X\) is a measurable function with respect to \(\mathcal{F}\) and \(\mathcal{E}\), which means
Definition (induced probability function) The induced probability function with respect to the original function is defined as
Note that this is a formal probability distribution, which means it satisfies Kolmogorov's axioms
Note that \(X\) is a discrete random variable if its range is countable
2. Law of Large Numbers
2.1. Independence
Measure theory ends and probability begins with the definition of independence.
Definition (independence)
- Two events \(A\), \(B\) are independent if \(P(A \cap B) = P(A) P(B)\)
- Two random variables \(X,Y\) are independent if for all \(C, D \in R\), \(P(X \in C, Y \in D) = P(X \in C)P(Y \in D)\)
- two \(\sigma\)-fields \(\mathcal{F}, \mathcal{G}\) are independent if for all \(A \in \mathcal{F}, B \in \mathcal{G}\) the events are independence
2.2. Weak Law of Large Numbers
2.3. Borel-Cantelli Lemmas
2.4. Strong Law of Large Numbers
3. Central Limit Theorems
4. Univariate models
4.1. Transformation
Definition (transformation) If \(X\) is a random variable, then any function of \(X\), \(g(X)\) is also a random variable (if \(g\) is a Borel measurable function), then probability distribution of \(Y\) is defined by
Corollary (transformation of support) It is important to keep track of the sample spaces of \(X\) and \(Y\), the support of \(Y\) is
Corollary (monotone transformation of cdf) If \(X\) have cdf \(F_X(x)\), let \(Y=g(X)\)
if \(g\) is an increasing function, then
if \(g\) is a decreasing function, then
By taking derivative of both sides, we obtain the transformation rules of pdf for monotone functions.
Note this is a variant of the integration by substitution (derived from the fundamental theorem of calculus) where \(g^{-1} = \varphi\)
Theorem (monotone transformation of pdf) Let \(X\) have pdf \(f_X(x)\) and \(Y=g(X)\), where \(g\) is a monotone function. Suppose \(f_X(x)\) is continuous and \(g^{-1}(y)\) has a continuous derivative on \(\mathcal{Y}\), then
Intuitively, the discussion above is simply
therefore, we get
Note
this only applies to the monotone functions, for functions that are not monotone (e.g.: \(Y=X^2\)), we need to compute a partition of \(X\) into where each \(X_i\) is monotone over \(g(X)\), then sum the inverse density \(f_X(g^{-1}(y))\) with its weight \(\frac{dg^{-1}(y)}{dy}\).
4.2. Expectation and Variance
Definition (expectation) Formally, Suppose \((\Omega, \mathcal{F}, P)\) is a probability space, If \(X \in \mathcal{L}^1 (P)\), then the expectation of the random variable \(X\) is devoted \(EX\) and defined by
When \(X\) is a discrete random variable with range $R_X = { x_1, x_2, ... } $ (finite or countably infinite). The expected value of \(X\), denoted by \(EX\) is defined as
Note that expectation does not always exist for any distribution, for example, the Cauchy distribution does have an expectation
Theorem (linearity of expectation)
Theorem (expectation of transformation) There are two ways to compute \(E[g(X)]\). One way is to compute PMF of \(Y = g(X)\), the other one is using follows (easier)
Theorem (Jensen's inequality) when f is a convex function, we have the following inequality
gap in Jensen's inequality
with Talyor approximation, gap between \(f(E[X])\) and \(E[f(x)]\) can be interepreted to be decided by variance of \(X\) and convexity of \(f\) (e.g. second derivative)
4.3. Moments
Moments reflects characteristics of distributions, however, the set of infinite moments is not enough to character the distribution. Two distinct random variables might have same moments set. To characterize distribution, both random variables have to have bounded support.
Definition (moment, central moment) For each integer \(n\), the \(n\)-th moment of \(X\), \(\mu_n\) is
The \(n\)th central moment of \(X\), \(\mu_n\) is
The 2nd central moment is the variance defined as follows
Definition (Variance) The variance of a random variable \(X\), with mean \(EX = \mu_X\) is defined as
The standard deviation of a random variable \(X\) is defined as
A simple way to compute variance is as follows
Lemma (relationship between moments) The previous \(Var(X)\) can be written as
The 3rd moment and 4th moment have similar relationship
Proposition (algebra of variance)
If \(X, Y\) are independent
Definition (Standardized moment) The standardized moment is the normalized central moment defined as
The 3rd standardized moment is called the skewness, which measures the lack of symmetry
The 4th standardized moment is called the kurtosis, which measures the peakedness of the pdf
While the moments might not be efficient to characterize distributions, the following moment generating function can characterize distributions if it exists
Definition (moment generating function, mgf) The moment generating function of \(X\), denoted by \(M_X(t)\) is following, provided that expectation exist for \(t\) in some neighborhood of 0.
Note: \(M_X(t)\) is the Laplace transform of \(f_X(x)\)
Lemma (algebra over mgf)
The moment generation function is called as it is because it can be used to generate moments by differentiation.
Theorem (moment generation) If \(X\) has mgf \(M_X(t)\), it can generate moments as follows
Note that the main use of the mgf is not to generate moments, but to characterize distributions, this characterizes an infinite set of moments. However, characterizing a infinite set of moments is not enough to determine a distribution uniquely. Two different distribution might have same moments.
To uniquely determine moments, we require the bounded support.
Theorem (determinations of distribution) Let \(F_X(x), F_Y(y)\) be two cdfs all of whose moments exist, If the moment generating function exists and \(M_X(t)=M_Y(t)\) for all \(t\) in some neighborhood of 0, then
Theorem (convergence of mgfs) Convergence for \(|t| < h\) of mgfs to an mgf implies convergence of pdfs
While moment generating functions might not always exist, the characteristic function always exist and also characterize the random variable
Definition (characteristic function) The characteristic function for a random variable is defined as
5. Multivariate Models
The probability models that involve more than one random variable are called multivariate models.
Definition (n-dimensional random vector) An n-dimensional random vector is a function from a sample space \(S\) in to \(R^n\), n-dimensional Euclidean space.
5.1. Joint and Marginal Distributions
The random vector is called a discrete random vector when it has only a countable number of possible values, otherwise it is called a continuous random vector.
Definition (joint PMF) Let \((X,Y)\) be a discrete bivariate random vector. Then the function \(f(x,y): R^2 \to R\) defined by \(f(x,y) = P(X=x, Y=y)\) is called the joint probability mass function or joint pmf of \((X,Y)\).
The joint pmf can be used to compute the probability of any event.
Definition (marginal PMF) Let \((X,Y)\) be a discrete bivariate random vector with joint pmf \(f_{X,Y}(x,y)\). Then the marginal pmfs of \(X\), \(f_X(x) = P(X = x)\) is given by
Definition (joint PDF) A function \(f(x,y): R^2 \to R\) is called a joint probability function or joint pdf of the continuous bivariate random vector \((X,Y)\) if for every event \(A \in R^2\),
Definition (marginal PDF) The marginal probability density function of \(X,Y\) are also defined as in the discrete case with integrals replacing sums.
Definition (joint CDF) The joint distribution of \((X,Y)\) can also be completely described with the joint cdf
5.2. Conditioning and Independence
Oftentimes when two random variables \((X,Y)\) are observed, the values of the two variables are related. Knowledge about the value of \(X\) gives us some infomation about the value of \(Y\).
Definition (conditional pmf, pdf) Let \((X,Y)\) be a discrete/continuous bivariate random vector with joint pmf/pdf \(f(x,y)\) and marginal pmfs/pdfs \(f_X(x), f_Y(y)\), the conditional pmf/pdf of \(Y\) given that \(X=x\) is the function of \(y\) denoted by \(f(y|x)\)
Note that this is a valid probability with respect to \(y\).
Since \(Y|X=x\) is a valid random variable, we can compute expectation of any function of \(Y\) \(g(Y)\)
Definition (Conditional expectation) The conditional expected value of \(g(Y)\) given that \(X=x\) is denoted by \(E(g(Y)|x)\)
Note that this is a function of \(x\)
Similarly, we can compute the conditional variance of \(Y|x\).
The conditional distribution of \(Y\) given \(X=x\) is possibly a different prob distribution for each \(x\), therefore we have a family of prob distribution for \(Y\) for each \(x\), when we wish to describe this entire family, we use the phrase "the distribution of \(Y|X\)
In some situations, the knowledge that \(X=x\) does not give us any information about \(Y\), this relationship is called independence.
Definition (independence) Let \((X,Y)\) be a bivariate random vector with joint pdf or pmf \(f(x,y)\) and marginal pdfs or pmfs \(f_X(x), f_Y(y)\). Then \(X,Y\) are called independent random variables if for every \(x, y \in R\)
If they are independent
To check that two random variables are independent, one way is to check all \(x, y \in R\) combinations. This require the knowledge of \(f_X(x), f_Y(y)\), which is sometimes difficult.
Criterion (joint pdf factorization) Another good criterion is to check whether the joint distribution \(f(x,y)\) can be factorized into two components as follows
Those independence can simplifying computations as follows
Theorem (independent computing) Suppose that \(X,Y\) are independent random variables, then their events are also independent which means
The expectation can also be factorized into respective components
Additionally, let \(g(x)\) be a function only of \(x\) and \(h(y)\) be a function only of \(y\), then the random variable \(U=g(x), V=h(y)\) are independent
Theorem Let \(X,Y\) be independent random variables. Let \(g(X)\) be a function only of \(x\) and \(h(y)\) be a function only of \(y\). Then the random variables \(U,V\) are independent
Proposition (law of total probability)
Proposition (two continuous random variables)
5.3. Bivariate Transformation
The Bivariate transformation is a generalization of the previous single variable transformation. It is also a variant of the muliivariate integrate by substitution (change of variable), for example,
Theorem (the method of transformations) Let \(X, Y\) be two jointly continuous random variables. Let \((Z,W) = g(X,Y) = (g_1(X,Y), g_2(X,Y))\) where \(g: R^2 \to R^2\) is a continuous invertible function with continuous partial derivatives. Let \(h = g^{-1}, ie.e., (X,Y) = h(Z,W) = (h_1(Z,W), h_2(Z,W))\) Then \(Z,W\) are jointly continuous and their joint PDF \(f_{ZW}(z,w)\) is given by
where \(J\) is the Jacobian of \(h\)
This can be used to compute multivariate distribution such as \(X+Y, XY\)
X+Y
Let \(X,Y\) be random variables having joint density \(f(x,y)\), then the density function of \(U=X+Y\)
by using the linear transformation \(U=X+Y, V=X\)
When \(X,Y\) are independent variables with density function \(f_1, f_2\) respectively, it becomes
which is the convolution of \(f_1, f_2\)
Proposition (mgf of independent random variables) Let \(X,Y\) be independent random variables with moments generating functions \(M_X(t), M_Y(t)\). Then the moment generating function of the random variable \(Z = X+Y\) is given by
Laplace transform
Recall that moment generating function is kind of a Laplace transform, and in Laplace transform, we can convert convolution into multiplication
In probability, \(Z=X+Y\) represents the convolution, so the multiplication of moment generating function totally makes sense.
5.4. Hierarchical/Mixture Models
The advantage of the hierarchical models is that complicated process might be modeled by a sequence of relatively simple models.
Definition (mixture model) A random variable \(X\) is said to have a mixture distribution if the distribution of \(X\) depends on a quantity that also has a distribution.
Recall \(E(X|y)\) is a function of \(y\) and \(E(X|Y)\) is a random variable whose value depends on \(Y\) (this is similar to the single variable transformation such as \(Y \to Y^2\))
Proposition (law of total expectation) If \(X,Y\) are any two random variables
application of the law total expectation
Suppose we have two random variables \(X,Y\) where
We can compute EX as follows
Similarly we can expand the variance with respect to the other random variable.
Proposition (law of total variance)
proof Let \(V=Var(X|Y), Z=E(X|Y)\), then \(V=E(X^2|Y)-Z^2\), taking E on two sides , we get \(EV = EX^2-EZ^2\) with law of total expectation, notice that \(Var(Z) = EZ^2 - (EZ)^2 = EZ^2 - (EX)^2\), we got the target formula by combining them together.
There is an interesting interpretation in the Bayesian statistics, when \(Y=\theta\)
This implies
which means on average, the posterior variance of \(\theta\) given dataset \(X\) is samller than the prior variance.
law of total variance
Consider the following discrete joint distribution
\(Y=0\) | \(Y=1\) | |
---|---|---|
\(X=0\) | \(1/5\) | \(2/5\) |
\(X=1\) | \(2/5\) | 0 |
we can easily find that \(Var(E(X|Y)) = 8/75, E(Var(X|Y)) = 2/15, Var(X) = 6/25\) which satisfies the law of total variance.
5.5. Covariance and Correlation
The covariance and correlation measure the strength of a kind of linear relationship.
Definition (covariance) The covariance between \(X,Y\) is defined as
It can be simplified by
Definition (correlation coefficient) The correlation of \(X,Y\) is the number defined by
If we define \(U,V\) as
then
Lemma (properties of covariance)
- \(Cov(X,X) = Var(X)\)
- If \(X, Y\) are independent then \(Cov(X,Y)=0\)
- \(Cov(X,Y) = Cov(Y,X)\)
- \(Cov(aX,Y) = aCov(X,Y)\)
- \(Cov(X+c, y)=Cov(X,Y)\)
- \(Cov(X+Y, Z)=Cov(X,Z)+Cov(Y,Z)\)
We aan summarize them into the linear property:
Proposition (independence and covariance) If \(X,Y\) are independent random variables, then \(\text{Cov}(X,Y) = 0\)
Proof When \(X,Y\) are independent \(\text{Cov}(X,Y) = EXY - EXEY = EXEY - EXEY = 0\)
However, the opposite is not always true, i.e. covariance does not necessarily means independance. In some special cases, it is true though (see C&B Lemma 5.3.3)
Covariance and correlation measure only a particular kind of linear relationship. To measure the general independence relation, use mutual information instead.
If \(I(X; Y)==0\) then \(X,Y\) are independent.
discrete \((X,Y)\) has covariance 0 but dependent
Consider random variable \(X\) is takes ±1 with 0.5 prob each, \(Y\) is 0 when \(X=0\) and \(Y=+/-1\) when \(X=1\).
It is clearly dependent but \(Cov(X,Y) = 0\)
continuous \((X,Y)\) has covariance 0 but dependent
Cosider random variable \(X \sim N(0, 1), Y=X^2\) they are obviously dependent, but
However, they are some limited cases that covariance implies indepdenent
Proposition Let \(X_j \sim n(\mu_j, \sigma^2_j)\) independent, For constants \(a_{i,j}, b_{i,j}\), define
The random variable \(U, V\) are independent iff \(Cov(U, V)=0\)
Proposition (variance of a sum)
If \(X,Y\) are independent (or uncorrelated)
Proposition (properties of correlation coefficient)
5.6. Multivariable Models
Definition (multivariable random vector) The random vector \(X=(X_1, ..., X_n)\) has a sample space that is a subset of \(R^n\). If \((X_1, ..., X_n)\) is a discrete random vector (sample space is countable), the joint pmf of \((X_1, X_2, ..., X_n)\) is defined by
then the for any \(A \subset R^n\)
If \((X_1, ..., X_n)\) is a continuous random vector,
Definition (expected value) Let \(g(x) = g(x_1, ..., x_n)\) be a real-valued function defined on the sample space of \(X\). THen \(g(X)\) is a random variable and expected value of \(g(X)\) is
Definition (marginal pdf)
Definition (conditional pdf)
Definition (mutual indepdenent random vector) Let \(X_1, ..., X_n\) be random vectors with joint pdf/pmf \(f(x_1, ..., x_n\)). Let \(f_{X_i}(x_i)\) denote the marginal pdf/pmf of \(X_i\), then \(X_1, ..., X_n\) are called mutually independent random vectors if for every \(x_1, ..., x_n\)
Definition (variance-covariance matrix) The variance-covariance matrix is represented as
It is a symmetric matrix.
If we partition \(\mathbf{X}\) into two groups: \(\mathbf{X}^{(1)}, \mathbf{X}^{(2)}\), then the variance-covariance matrix can also be paritioned into components
where \(\Sigma_{12} = Cov(\mathbf{X}^{(1)}, \mathbf{X}^{(2)})\)
Lemma (linear combinations) Suppose \(\mathbf{Z = CX}\) (e.g: \(Z_1 = c_{1,1}X_1 + ... + c_{1, p}X_p\))
then
Definition (identifiability) The parameterization \(\theta \in \Theta\) is identifiable if \(Y_1 \sim P_{\theta_1}, Y_2 \sim P_{\theta_2}\) and \(Y_1 \sim Y_2\) imply that \(\theta_1 = \theta_2\)
6. Asymptotics
Convergence concepts are useful in approximating finite size sample because their expression can be simplified when taking limits. The relation between convergences are as follows:
6.1. Almost Sure Convergence
Almost sure convergence is similar to the pointwise convergence \(\lim X_n = X\), except the convergence need not occur on a set with measure 0
Let's the sample splace \(S\) has elements denoted by \(s\), then \(X_n(s)\) and \(X(s)\) are functions defined on \(S\). This convergence says \(X_n\) converges to \(X\) almost surely if the functions \(X_n(s)\) converges to \(X(s)\) for all \(s \in S\) except a set with measure 0.
Definition (almost sure convergence) A sequence of random variables, \(X_1,X_2, ...\) converges almost surely to a random variable \(X\) if, for every \(\epsilon > 0\)
Formally, the almost sure convergence is defined as follows:
Let \(\Omega\) be a set of probability mass \(1\) (\(P(\Omega)=1\)), then for any \(\omega \in \Omega\) and for any \(\epsilon > 0\), there exists a \(N(\omega, \epsilon)\) such that when \(n > N\)
Theorem (Strong Law of Large Numbers) Let \(X_1, X_2, ...\) be iid random variables with \(EX_i = \mu\) and \(Var(X_i) = \sigma^2 < \infty\), then for every \(\epsilon > 0\)
6.2. Convergence in Probability
Definition (convergence in probability) A sequence of random variables $X_1, X_2, ..., $ converges in probability to a random variable \(X\) if for every \(\epsilon > 0\)
Note that \(X_n\) are usually not IID random variable, and \(X\) is common to be a fixed number. The most famous result is the following one:
Theorem (Weak Law of Large Numbers) Let \(X_1, X_2, ...\) be iid random variables with \(EX_i=\mu, Var(X_i) = \sigma^2 < \infty\). Then for very \(\epsilon > 0\)
that is, \(\bar{X}_n\) converges in probability to \(\mu\)
This theorem can be proved by using the Chebychev's inequality
Convergence in probability is highly related to the consistency concept in statistics. Suppose we have an estimator \(\hat{\theta}\) for some quantity \(\theta\), \(\hat{\theta}\) is said to be consistent if it converges to \(\theta\) in probability.
One related useful theorem about consistency is the following one
Theorem (consistency preserved by continuous function) Suppose \(X_1, X_2, ...\) converges in probability to a random variable \(X\) and that \(h\) is a continuous function. Then $h(X_1), h(X_2), ... $ converges in probability to \(h(X)\)
Among those convergence concepts, convergence in distribution is the weakest form.
6.3. Convergence in Quadratic Mean
Definition (convergence in quadratic mean) Sometimes to show a stronger form of convergence is useful to prove convergence in probability. The following one is known as the convergence in quadratic mean \(\(\lim_n E(X_n - X)^2 \to 0\)\)
Convergence in quadratic mean implies convergence in probability because
Intuitively, quadratic mean convergence penalizes the deviation by the square form while the probability convergence penalize deviation by the absolute form, therefore quadratic mean is a stronger form.
qm convergence does not imply p convergence
Consider the random variable \(X_n = \sqrt{n} \mathbf{1}_{[0, 1/n]}(U)\) where \(U\) is uniform, \(X_n\) converges to 0 in probability
However, the quadratic mean is not zero
6.4. Convergence in Distribution
Definition (convergence in distribution) A sequence of random variables \(X_1, X_2, ...\) converges in distribution to a random variable \(X\) if
at all points where \(F_X(x)\) is continuous
6.5. Central Limit Theorem
Theorem (Central Limit Theorem, classical CLT) Let \(X_1, X_2, ...\) be a sample whose mgs exist in a neighborhood of \(0\). Let \(EX_i = \mu, Var(X_i) = \sigma^2\) and
then \(G_n(x)\) converges to standard normal distribution
Instead of the true variance \(\sigma^2\) , we can use the estimated variance \(S^2\), it still has the CLT using Slutsky and continuous mapping theorem
When \(X_i\) are independent, but not identically distributed, if the moments of some order $2+\delta $ is bounded, then we still has the CLT
Theorem (Lyapunov CLT) Suppose \(X_1, ..., X_n\) is a sequence of independent random variables with mean \(\mu_i\) and variance \(\sigma_i^2\). Let \(s^2_n = \sum_i \sigma^2_i\), if for some \(\delta > 0\), the Lyapunov condition is satisfied:
then we have the CLT
Note there is another related (weaker) condition called Linderberg CLT
Theorem (Multivariate CLT) If \(X_1, ..., X_n\) are iid with mean \(\mu\) and covariance matrix \(\Sigma\), then
The rate of CLT convergence is roughly at \(1/\sqrt{n}\)
Theorem (Berry-Essen) Suppose \(X_1, ..., X_n\) are iid and has mean \(\mu\), variance \(\sigma^2\) and 3rd moment \(\mu_3 = E|X - \mu|^3\). let
then
6.6. Delta Method
Suppose we have a sequence of random variable \(X_i\) converging to normal distribution, we can also characterize the limiting distribution of \(g(X_i)\) where \(g\) is a smooth function
Theorem (delta method) Suppose
Then
7. Reference
- [1] Pishro-Nik, Hossein. "Introduction to probability, statistics, and random processes." (2016).
- [2] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.
- [3] Axler, Sheldon. Measure, Integration & Real Analysis. Springer Nature, 2020.APA
- [4] Çınlar, Erhan. Probability and stochastics. Vol. 261. Springer Science & Business Media, 2011.