0x040 Probability

Contents Index

Probability Theory

Basic Concepts

Suppose $\mathcal{F}$ is a $\sigma$-algebra on a set $\Omega$.

Definition (probability measure) A probability measure on $(\Omega, \mathcal{F})$ is a measure $P$ on $(\Omega, \mathcal{F})$ such that $P(\Omega)=1$

Definition (sample space) $\Omega$ is called the sample space. Intuitively, it is the set of possible outcomes of an experiment.

Definition (event) An event is an element of $\mathcal{F}$.

Definition (probability) If $A$ is an event, then $P(A)$ is called the probability of $A$

Definition (probability space) If $P$ is a probability measure on $(\Omega, \mathcal{F})$, then the triplet $(\Omega, \mathcal{F}, P)$ is called a probability space.

Definition (partition): We say that $A_1, A_2, …$ are disjoint or are mutually exclusive if $A_i \cap A_j = \emptyset$ whenever $i \neq j$. A partition of $\Omega$ is a sequence of disjoint set $A_1, A_2, …$ such that $\cup_i A_i = \Omega$

Axiom (Axiom of probability): A function $\mathbb{P}$ that assigns a real number $\mathbb{P}(A)$ to each event $A$ is a probability distribution or a probability measure if it satisfies the following three axioms.

• $\mathbb{P}(A) \geq 0$ for every A
• $\mathbb{P}(\Omega)=1$
• $\mathbb{P}(\cup_i A_i) = \sum_i \mathbb{P}(A_i)$ if A are disjoint

There are two interpretations $\mathbb{P}(A)$. The two common interpretations are frequencies and degrees of beliefs. Frequentist says that $P(A)$ is the long run proportion of times that $A$ is true in reptitions. The degree of belief interpretation says that $P(A)$ is the observer’s strength of belief that $A$ is true.

Lemma (Inclusion Exclusion Principle): $\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B) – \mathbb{P}(AB)$

Theorem (Continuity of Probability) : If $A_n \rightarrow A$ then $\mathbb{P}(A_n) \rightarrow \mathbb{P}(A)$.

Discrete/Continuous Models

Definition (Discrete Probability Model) Consider a sample space $S$. If $S$ is a countable set, this refers to a discrete probability model.

In this case, since $S$ is countable, we can list all elements in $S$, $S = \{ s_1, s_2, … \}$, If $A \subset S$ is an event, then $A$ is also countable, and by the third axiom of probability, we can write

$$P(A) = P(\cup_{s_j \in A}) = \sum_{s_j \in A} P(s_j)$$

In a finite sample space $S$, where all outcomes are equally likely, the probability of any event $A$ can be found by $P(A) = \frac{|A|}{|S|}$

Conditional Probability

Conditional Probability

All probabilities are calculated with respect to a sample space, but in many cases, we are in a position to update the sample space with new information. In this case, we use conditional probability.

Definition (Conditional Probability) If $A,B$ are two events in a sample space and if $\mathbb{P}(B) > 0$ then the conditional probability of $A$ given $B$ is

$$\mathbb{P}(A|B) = \frac{\mathbb{P}(A \cap B)}{\mathbb{P}(B)}$$

Note that $B$ becomes that sample space here. In particular $P(\dot | B)$ is a probability (satisfying Kolmogorov’s axioms)

For any fixed $B$ such that $\mathbb{P}(B) > 0$, $\mathbb{P}( \cdot | B)$ is a probability measure (satisfying three axioms of probability)

prosecutor’s fallacy: fallacy from misunderstanding of $\mathbb{P}(A|B) \neq \mathbb{P}(B|A)$

Lemma : for any pair of events $A$ and $B$

$$\mathbb{P}(AB) = \mathbb{P}(A|B)\mathbb{P}(B) = \mathbb{P}(B|A)\mathbb{P}(A)$$

Theorem (The Law of Total Probability) Let $A_1, …, A_k$ be a partition of $\Sigma$, Then for any event $B$

$$\mathbb{P}(B) = \sum_i \mathbb{P}(B|A_i)\mathbb{P}(A_i)$$

Theorem (Bayes’ Theorem)

$$\mathbb{P}(A_i | B) = \frac{\mathbb{P}(B|A_i)\mathbb{P}(A_i)}{\sum_j \mathbb{P}(B|A_j)\mathbb{P}(A_j)}$$

Independent Events

Definition (Independence) Two events $A$ and $B$ are independent if $\mathbb{P}(AB) = \mathbb{P}(A)\mathbb{P}(B)$

Independence can arise in two distinct ways

• 1: explicitly assume independence
• 2: derive independence by verifying the previous definition

Note that disjoint events with positive probability is not independent !!

Definition (mutual independence) A collection of events $A_1, …, A_n$ are mutually independent iff for any subcollection $A_{i_1}, …, A_{i_k}$

$$P( \cap_{j=1}^{k} A_{i_{j}} ) = \prod_{j} P(A_{i_j})$$

Random variable

random variable is neither random nor a variable

Definition (Random Variable) Suppose $(\Omega, \mathcal{F}, P)$ is a probability space. A random variable on $(\Omega, \mathcal{F})$ is a measurable function from $\Omega$ to $R$. Intuitively, a random variable is a function from the sample space to another sample space (i.e. R)

Note that random variable can even be defined to project to measurable space other than $(R, B)$.

Definition ((more general) random variable) Let $(E, \mathcal{E})$ be a measurable space. A mapping $X: \Omega \to E$ is called a random variable if $X$ is a measurable function with respect to $\mathcal{F}$ and $\mathcal{E}$, which means

$$\forall(A \in \mathcal{E}) X^{-1} (A) \in \mathcal{F}$$

Definition (induced probability function) The induced probability function with respect to the original function is defined as

$$P_X(\{ x_i \}) = P_X(X = x_i) = P( \{ s_j \in S | X(s_j) = x_i \} )$$

Note that this is a formal probability distribution, which means it satisfies Kolmogorov’s axioms

Note that $X$ is a discrete random variable if its range is countable

$$R_X = \{ x_1, x_2, … \}$$

Distribution

Definition (cumulative density function) The cumulative density function of a random variable $X$, denoted by $F_X(x)$ is defined by

$$F_X(x) = P_X(X \leq x)$$

Proposition (characterization of cdf) The function $F(x)$ is a cdf iff

• $\lim_{x \to -\infty} F(x) = 0$
• $\lim_{x \to \infty} F(x) = 1$
• $F(x)$ is nondecreasing function and right-continuous

Definition (Probability Mass Function) Let $X$ be a discrete random variable with range $R_x = \{ x_1, x_2, x_3, … \}$ the following function is called the probability mass function of $X$

$$P_X(x_k) = P(X=x_k) = P( \{ x \in S | X(s) = x_k \})$$

$$P(X \in A) = \sum_{x \in A} P_X(x)$$

By definition, PMF is a probability measure. In particular, we have

$$0 \leq P_X(x_k) \leq 1$$

$$\sum_{x \in R_X} P_X(x) =1$$

Transformations

Univariate models

How to compute distribution of the random variable $Y$ given $Y=g(X)$ and pdf of $X$ ?

Solution: compute a partition of $X$ into where each $X_i$ is monotone over $g(X)$, then sum the inver
se density $f_X(g^{-1}(y))$ with its weight $\frac{dg^{-1}(y)}{dy}$.

Expectation and Variance

Definition (expectation) Formally, Suppose $(\Omega, \mathcal{F}, P)$ is a probability space, If $X \in \mathcal{L}^1 (P)$, then the expectation of the random variable $X$ is devoted $EX$ and defined by

$$EX = \int_{\Omega} X dP$$

Notet that Intuitively, Let $X$ be a discrete random variable with range $R_X = \{ x_1, x_2, … \}$ (finite or countably infinite). The expected value of $X$, denoted by $EX$ is defined as

$$EX = \sum_{x_k \in R_X} x_k P(X=x_k)$$

Theorem (linearity of expectation)

$$E[aX + b] = aE[X] + b$$

$$E[X_1 + X_2 + … + X_n ] = EX_1 + EX_2 + … + EX_n$$

Theorem There are two ways to compute $E[g(X)]$. One way is to compute PMF of $Y = g(X)$, the other one is using follows (easier)

$$E[g(X)] = \sum_{x_k \in R_X} g(x_k) P_X(x_k)$$

Moments

Moments reflects characteristics of distributions, however, the set of infinite moments is not enough to character the distribution. Two distinct random variables might have same moments set. To characterize distribution, both random variables have to have bounded support.

Definition (moment, central moment) For each integer $n$, the $n$-th moment of $X$, $mu_n$ is

$$\mu^{‘}_n = EX^n$$

The $n$th central moment of $X$, $\mu_n$ is

$$\mu_n = E(X-\mu)^n$$

The 2nd central moment is the variance defined as follows

Definition (Variance) The variance of a random variable $X$, with mean $EX = \mu_X$ is defined as

$$Var(X) = E[ (X-\mu_X)^2 ]$$

The standard deviation of a random variable $X$ is defined as

$$SD(X) = \sigma_X = \sqrt{Var(X)}$$

A simple way to compute variance is as follows

$$Var(X) = E[X^2] – (EX)^2$$

Proposition (Properties of variance)

$$Var(aX + b) = a^2Var(X)$$

$$Var(X) = Var(X_1) + Var(X_2) + … Var(X_n) \text{ if } X = X_1 + X_2 + … + X_n$$

$$Var(X+Y) = Var(X) + Var(Y) + 2Cov(X,Y)$$

If $X, Y$ are independent

$$Var(XY) = E(X^2Y^2) – (EXY)^2 = Var(X)Var(Y) + Var(X)E(Y)^2 + Var(Y)E(X)^2$$

Definition (Standardized moment) The standardized moment is the normalized central moment defined as

$$\hat{\mu}_k = \frac{\mu_k}{\sigma^k}$$

The 3rd standardized moment is called the skewness, which measures the lack of symmetry

The 4th standardized moment is called the kurtosis, which measures the peakedness of the pdf

While the moments might not be efficient to characterize distributions, the following moment generating function can characterize distributions if it exists

Definition (moment generating function) The moment generating function of $X$, denoted by $M_X(t)$ is following, provided that expectation exist for $t$ in some neighborhood of 0.

$$M_X(t) = Ee^{tX}$$

Note: $M_X(t)$ is the Laplace transform of $f_X(x)$

$$M_X(t) = \int_{-\infty}^{\infty} e^{tx} f_X(x) dx$$

The moment generation function is called as it is because it can be used to generate moments by differentiation.

Theorem (moment generation) If $X$ has mgf $M_X(t)$, it can generate moments as follows

$$EX^n == M_X^{(n)} (0)$$

Note that the main use of the mgf is not to generate moments, but to characterize distributions.

Theorem (convergence of mgfs) Convergence for $|t| < h$ of mgfs to an mgf implies convergence of pdfs

While moment generating functions might not always exist, the characteristic function always exist and also characterize the random variable

Definition (characteristic function) The characteristic function for a random variable is defined as

$$\varphi_X(t) = E(e^{itX})$$

Bivariable Models

Proposition (law of total probability)

$$P(A) = \int_{-\infty}^{\infty} P(A|X=x) f_X(x) dx$$

Proposition (law of total expectation)

$$E[Y] = \int_{-\infty}^{\infty} E[Y|X=x] f_X(x) dx$$

$$E[Y] = E[E[Y|X]]$$

proposition (law of total variance)

$$Var(Y) = E[Var(Y|X)] + Var(E[Y|X])$$

Conditioning and Independence

Definition (conditional PMF, conditional CDF)

$$P_{X | A}(x_i) = P(X=x_i|A) = \frac{P(X=x_i \cap A)}{P(A)}$$

$$P_{X|Y}(x_i|y_j) = \frac{P_{XY}(x_i, y_j)}{P_Y(y_j)}$$

$$F_{X|A}(x) = P(X \leq x|A)$$

Definition (Independence) Two discrete random variables $X,Y$ are independent iff

$$(\forall x, y) P_{XY}(x, y) = P_X(x)P_Y(y)$$

Definition (Conditional expectation)

$$E[X|A] = \sum_{x \in R_X} x_i P_{X|A}(x_i)$$

$$E[X|Y=y_j] = \sum_{x \in R_X} x_i P_{X|Y}(x_i|y_j)$$

Proposition (Law of Total Probability, Low of Total Expectation)

$$P(X \in A) = \sum_{y_j \in R_Y} P(X \in A|Y=y_j)P_Y(y_j)$$

$$EX = \sum_{y_j \in R_Y} E[X|Y=y_j]P_Y(y_j)$$

Proposition (two discrete random variables)

$$E[g(X,Y)] = \sum_{(x_i, y_j) \in R_{XY}} g(x_i, y_j) P_{XY}(x_i, y_j)$$

Proposition (Law of Total Expectation, Law of Iterated Expectations) This is just a different way of writing the law of total expectation

$$E[X] = E[E[X|Y]]$$

Theorem (Law of Total Variance)

$$Var(X) = E[(Var(X|Y)] + Var(E[X|Y])$$

Two Continuous Random Variables

Definition (joint PDF) joint probability density function of two random variables $X,Y$ is

$$P((X,Y) \in A) = \iint_A f_{XY}(x,y) dxdy$$

Definition (marginal PDF)

$$f_X(x) = \int_{-\infty}^{\infty} f_{XY}(x,y) dy$$

$$f_Y(y) = \int_{-\infty}^{\infty} f_{XY}(x,y) dx$$

Definition (joint CDF)

$$F_{XY}(x,y) = P(X \leq x, Y \leq y) = \int_{-\infty}^{y} \int_{-\infty}^{x} f_{XY}(u, v) du dv$$

Conditioning and Independence

Definition (conditional probability)

$$f_{X|Y}(x|y) = \frac{f_{XY}(x,y) }{f_Y(y)}$$

$$F_{X|Y}(x|y) = P( X \leq x | Y =y) = \int_{-\infty}{x} f_{X|Y}(x|y) dx$$

Definition (independence) Let $(X,Y)$ be a bivariate random vector with joint pdf or pmf $f(x,y)$ and marginal pdfs or pmfs $f_X(x), f_Y(y)$. Then $X,Y$ are called independent random variables if

$$\forall(x \in R, y \in R) f(x, y) = f_X(x) f_Y(y)$$

To check that two random variables are independent, one way is to check all $x, y \in R$ combinations, which is sometimes difficult.

Criterion (joint pdf factorization) Another good criterion is to check whether the joint distribution $f(x,y)$ can be factorized into two components as follows

$$f(x,y) = g(x) h(y)$$

Those independence can simplifying computations as follows

Suppose that $X,Y$ are independent random variables, then their events are also independent which means

$$P(X \in A, Y \in B) = P(X \in A) P(Y \in B)$$

The expectation can also be factorized into respective components

$$E(g(X)h(Y) = E(g(X)) E(h(Y))$$

Additionally, let $g(x)$ be a function only of $x$ and $h(y)$ be a function only of $y$, then the random variable $U=g(x), V=h(y)$ are independent

Proposition (mgf of independent random variables) Let $X,Y$ be independent random variables with moments generating functions $M_X(t), M_Y(t)$. Then the moment generating function of the random variable $Z = X+Y$ is given by

$$M_Z(t) = M_X(t) M_Y(t)$$

Proposition (law of total probability)

$$P(A) = \int_{-\infty}^{\infty} P(A|X=x) f_X(x) dx$$

proposition (two continuous random variables)

$$E[g(X,Y)] = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} g(x,y) f_{XY}(x,y) dx dy$$

Theorem (the method of transformations)

Let $X, Y$ be two jointly continuous random variables. Let $(Z,W) = g(X,Y) = (g_1(X,Y), g_2(X,Y))$ where $g: R^2 \to R^2$ is a continuous invertible function with continuous partial derivatives. Let $h = g^{-1}, ie.e., (X,Y) = h(Z,W) = (h_1(Z,W), h_2(Z,W))$ Then $Z,W$ are jointly continuous and their joint PDF $f_{ZW}(z,w)$ is given by

$$f_{ZW}(z,w) = f_{XY}(h_1(z,w), h_2(z,w)) |J|$$

where $J$ is the Jacobian of $h$

Hierarchical Models

The advantage of the hierarchical models is that complicated process might be modeled by a sequence of relatively simple models.

Definition (mixture model) A random variable $X$ is said to have a mixture distribution if the distribution of $X$ depends on a quantity that also has a distribution.

Proposition (law of total expectation)

$$E[Y] = \int_{-\infty}^{\infty} E[Y|X=x] f_X(x) dx$$

$$E[Y] = E[E[Y|X]]$$

proposition (law of total variance)

$$Var(Y) = E[Var(Y|X)] + Var(E[Y|X])$$

Covariance and Correlation

The covariance and correlation measure the strength of a kind of linear relationship.

Definition (covariance) The covariance between $X,Y$ is defined as

$$\text{Cov}(X,Y) = E[(X – EX)(Y-EY)]$$

It can be simplified by

$$\text{Cov}(X,Y) = E[XY] – (EX)(EY)$$

Lemma (properties of covariance)

• $Cov(X,X) = Var(X)$
• If $X, Y$ are independent then $Cov(X,Y)=0$
• $Cov(X,Y) = Cov(Y,X)$
• $Cov(aX,Y) = aCov(X,Y)$
• $Cov(X+c, y)=Cov(X,Y)$
• $Cov(X+Y, Z)=Cov(X,Z)+Cov(Y,Z)$

$$Cov(\sum_i a_iX_i, \sum_j b_j Y_j) = \sum_i \sum_j a_i b_j Cov(X_i, Y_j)$$

Proposition (independence and covariance) If $X,Y$ are independent random variables, then $\text{Cov}(X,Y) = 0$

However, the opposite is not always true, i.e. covariance does not necessarily means independance.

Proposition (variance of a sum)

$$Var(aX+bY) = a^2Var(X) + b^2Var(Y) + 2abCov(X,Y)$$

Definition (correlation coefficient)

$$U = \frac{X – EX}{\rho_X}, V = \frac{Y-EY}{\rho_Y}$$

$$\rho_{XY} = Cov(U, V) = \frac{Cov(X,Y)}{\sigma_X \sigma_Y}$$

Proposition (properties of correlation coefficient)

• $-1 \leq \rho(X,Y) \leq 1$
• $\rho(aX+b, cY+d) = \rho(X,Y)$
• If $X,Y$ are uncorrelated, then $Var(X+Y)=Var(X)+Var(Y)$

Multivariable Models

Definition (n-dimensional random vector) An n-dimensional random vector is a function from a sample space $S$ in to $R^n$, n-dimensional Euclidean space.

Convergence

Tail Bound

The following inequalities are mostly about the bound of distribution tail.

Inequality (Markov) Let $X$ be a nonnegative random variable (i.e. $X \geq 0$). Then

$$P(X \geq t) \leq \frac{EX}{t} = O(\frac{1}{t})$$

Chebyshev can be obtained from Markov inequality by taking the absolute value. (to maintain the nonnegativity)

It has a better bound $O(\frac{1}{t^2})$ than Markov.

Inequality (Chebyshev) Let $X$ be a random variable and $EX = \mu, \mathrm{Var} X = \sigma^2$, then for any $t > 0$

$$P( | X – \mu | \geq t \sigma) \leq \frac{1}{t^2} = O(\frac{1}{t^2})$$

Chernoff bound can also be obtained from Markov inequality by taking exp.

It has an even better exponential bound than Chebyshev

Inequality (Chernoff)

$$P((X – \mu) \geq u) = P(\exp(t(X-\mu)) \geq \exp(tu)) \leq \frac{E[\exp(t(X-\mu)]}{\exp(tu)}$$

Suppose mgf exist in a neighbourhood of $b$, then

$$P((X-\mu) \geq u) \leq \inf_{0 \leq t \leq b} \exp(-t(u+\mu)) E[\exp(tX)]$$

Definition (sub-Gaussian random variable) A sub-Gaussian random variable whose tails decay faster or equal than a Gaussian. Formally, a random variable $X$ with mean $\mu$ is sub-Gaussian if there exists a positive number $\sigma$ such that

$$E[\exp(t(X-\mu))] \leq \exp(\sigma^2 t^2/2)$$

Inequality (Hoeffding) Suppose a random variable is bound between $[a,b]$, then

$$P(|\bar{X} – \mu| < t) \leq 2 \exp (-\frac{2nt^2}{(b-a)^2})$$

Equality and Inequalities

Inequality (Hölder) Let $X,Y$ be any two random variables and $1/p + 1/q = 1$, then

$$|EXY| \leq E |XY| \leq (E|X|^p)^{1/p} (E|Y|^q)^{1/q}$$

When $p=q=2$, it is the Cauchy-Schwarz inequality

Inequality (Minkowski) Let $X,Y$ be any two random variables and $1 \leq p \leq \infty$

$$(E|X+Y|^p)^{1/p} \leq (E|X|^p)^{1/p} + (E|Y|^{p})^{1/p}$$

Convergence Concepts

Convergence concepts are useful in approximating finite size sample because their expression can be simplified when taking limits. The relation between convergences are as follows:

almost sure convergence $\subset$ convergence in probability $\subset$ convergence in distribution

Definition (almost sure convergence) A sequence of random variables, $X_1,X_2, …$ converges almost surely to a random variable $X$ if, for every $\epsilon > 0$

$$P(\lim_{n \to \infty} |X_n – X| \leq \epsilon) = 1$$

Formally, the almost sure convergence is defined as follows:

Let $\Omega$ be a set of probability mass $1$ ($P(\Omega)=1$), then for any $\omega \in Omega$ and for any $\epsilon > 0$, there exists a $N(\omega, \epsilon)$ such that when $n > N$

$$|X_n(\omega) – X(\omega)| \leq \epsilon$$

Theorem (Strong Law of Large Numbers) Let $X_1, X_2, …$ be iid random variables with $EX_i = \mu$ and $Var(X_i) = \sigma^2 < \infty$, then for every $\epsilon > 0$

$$P(\lim_{n \to \infty} |\bar{X}_n – \mu| < \epsilon) = 1$$

Definition (convergence in probability) A sequence of random variables $X_1, X_2, …,$ converges in probability to a random variable $X$ if for every $\epsilon > 0$

$$\lim_{n \to \infty} P(|X_n – X| \geq \epsilon) = 0$$

Note that $X_n$ are usually not IID random variable, and $X$ is common to be a fixed number. The most famous result is the following one:

Theorem (Weak Law of Large Numbers) Let $X_1, X_2, …$ be iid random variables with $EX_i=\mu, Var(X_i) = \sigma^2 < \infty$. Then for very $\epsilon > 0$

$$\lim_{n \to \infty} P(|\bar{X}_n – \mu|<\epsilon) = 1$$

that is, $\bar{X}_n$ converges in probability to $\mu$

This theorem can be proved by using the Chebychev’s inequality

$$P(|\bar{X}_n – \mu| \geq \epsilon) = \frac{Var(\bar{X}}}{\epsilon^2} = \frac{\sigma^2}{n \epsilon^2} \to 0$$

Definition (convergence in quadratic mean) Sometimes to show a stronger form of convergence is useful to prove convergence in probability. The following one is known as the convergence in quadratic mean

$$\lim_n E(X_n – X)^2 \to 0$$

Convergence in quadratic mean implies convergence in probability because

$$P(|X_n – X| \leq \epsilon) \leq \frac{E(X_n – X)^2}{\epsilon^2} \to 0$$

Intuitively, quadratic mean convergence penalizes the deviation by the square form while the probability convergence penalize deviation by the absolute form, therefore quadratic mean is a stronger form.

Convergence in probability is highly related to the consistency concept in statistics. Suppose we have an estimator $\hat{\theta}$ for some quantity $\theta$, $\hat{\theta}$ is said to be consistent if it converges to $\theta$ in probability.

One related useful theorem about consistency is the following one

Theorem (consistency preserved by continuous function) Suppose $X_1, X_2, …$ converges in probability to a random variable $X$ and that $h$ is a continuous function. Then $h(X_1), h(X_2), …$ converges in probability to $h(X)$

Among those convergence concepts, convergence in distribution is the weakest form.

Definition (convergence in distribution) A sequence of random variables $X_1, X_2, …$ converges in distribution to a random variable $X$ if

$$\lim_{n \to \infty} F_{X_n}(x) = F_X(x)$$

at all points where $F_X(x)$ is continuous

Reference

[1] Pishro-Nik, Hossein. “Introduction to probability, statistics, and random processes.” (2016).

[2] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.

[3] Axler, Sheldon. Measure, Integration & Real Analysis. Springer Nature, 2020.APA

[4] Çınlar, Erhan. Probability and stochastics. Vol. 261. Springer Science & Business Media, 2011.