Skip to content

0x400 Foundation

1. Sampling Distribution

1.1. Chi-Square

Distribution (\(\chi^2\) distribution) \(\chi^2\) distribution is a special case of gamma distribution. The chi square distribution with p degrees of freedom has pdf of Gamma\((p/2, 2)\)

\[f(x|p) = \frac{1}{\Gamma(p/2)2^{p/2}} x^{p/2 - 1} e^{-x/2}\]

Note that when \(p=1\), the distribution can be computed by applying \(Y=X^2\) to normal distribution. (Cassela Example 2.1.9)

\[f_X(x) = \frac{1}{\sqrt{2\pi x}} e^{-x/2}\]

properties (\(\chi^2\) distribution) By applying Gamma distribution, we obtain following results

\[EX = p\]
\[Var(X) = 2p\]
\[EX^k = \frac{2^k\Gamma(\frac{r}{2}+k)}{\Gamma(\frac{r}{2})}\]
\[M_X(t) = (\frac{1}{1-2t})^{r/2}\]

Lemma If \(Z\) is a n(0,1) random variable \(Z^2 \sim \chi_1^2\)

Lemma If \(X_1, ..., X_n\) are independent and \(X_i \sim \chi_{p_i}^2\), then \(X_1 + ... +X_n \sim \chi^2_{p1+...+pn}\)

Distribution (noncentral \(\chi^2\) distribution) Let \(X_1, X_2, ..., X_n\) be an independent random variables where \(X_i \sim N(\mu_i, \sigma^2)\), then the distribution \(Y\)

\[Y = \frac{1}{\sigma^2} \sum_i X_i^2\]

is called to have a noncentral \(\chi^2\) distribution

\[Y \sim \chi^2(n, \frac{1}{\sigma^2}\sum_i \mu_i^2)\]

where the first parameter is again the degree of freedom, the second parameter is called the noncentrality parameter

1.2. T distribution

In most practical cases, the variance \(\sigma^2\) is unknown. thus to get any idea of the variability of \(\hat{X}\), we need to estimate this variance first.

Distribution (Student's t, Gosset) Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) distribution. The quantity \((\bar{X}-\mu)/(S/\sqrt{n})\) has Student's t distribution with \(n-1\) degrees of freedom. It has pdf

\[f_T(t) = \frac{\Gamma((p+1)/2)}{\Gamma(p/2)} \frac{1}{(p\pi)^{1/2} (1+t^2/p)^{(p+1)/2}}\]

The distribution is a distribution of \(U/\sqrt(V/p)\) where \(U \sim n(0,1), V \sim \chi^2_p\)

\[EX = 0\]
\[Var(X) = \frac{n}{n-2}\]

Student's t has no mgf because it does not have moments of all orders. If there are \(p\) degrees of freedom, then there are only \(p-1\) moments.

t distribution has an important property robustness, which means it is much less sensitive to the outlines than Gaussian distributions

Distribution (non-central t distribution) Let \(X \sim N(\mu, 1)\) and \(Y \sim \chi^2(n)\), then

\[W = \frac{X}{\sqrt{Y/n}} \]

has a noncentral t distribution with \(\mu\) noncentrality parameter

1.3. F distribution

Distribution (Snedecor's F, Fisher) Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu_X, \sigma^2_X)\) population and \(Y_1, ..., Y_m\) from \(n(\mu_Y, \sigma^2_Y)\) population. The random variable

\[F = (S_X^2/\sigma^2_X)/(S_Y^2/\sigma^2_Y)\]

has a Snedecor's F distribution with \(n-1, m-1\) degrees of freedom.

\[EX = \frac{m-1}{m-3}\]

Lemma

If \(X \sim F_{(p,q)}\), then \(1/X \sim F_{(q,p)}\)

If \(X \sim t_q\), then \(X^2 \sim F_{(1,q)}\)

2. Sampling Statistics

Sample is a sequence of iid random variable, statistic is its function.

Definition (random sample) The collection of random variables \(X_1, X_2, ... X_n\) is said to be a random sample of size \(n\) if they are independent and identically distributed (i.i.d), i.e

  • \(X_1, ..., X_n\) are independent random variables
  • They have same distribution
\[F_{X_1}(x) = ... = F_{X_n}(x)\]

\(x_1, x_2, ..., x_n\) is said to their realizations. The joint pdf or pmf or \(X_1, ..., X_n\) is given by

\[f(x_1, ..., x_n) = f(x_1) f(x_2) ... f(x_n) = \prod_{i=1}^n f(x_i)\]

This random sampling model is called sampling from an infinite population. Its assumption is that, after observing \(X_1 = x_1\), the probability distribution \(X_2\) is unaffected.

In the case of sampling from a finite population, there are two cases

  • sampling with replacement: this meets the previous definition
  • sampling without replacement: the random variables are no longer independent (but still identically distributed). When population is large enough, it can still be approximated with the previous condition.

Definition (statistic, sampling distribution) The random variable \(Y\) defined over a random sample \(X_1, ..., X_n\) and a real-valued function \(T\) is called a statistic,

\[Y=T(X_1, ..., X_n)\]

The statistics, however, cannot depend on any parameter \(\theta\)!!

The distribution of a statistic \(Y\) is called the sampling distribution of \(Y\). Once the sample is drawn, the \(t\) is called the realization of \(T\), where

\[t = T(x_1, x_2, ..., x_n)\]

statistics is random variable

I always get confused, but be careful that statistic is a random variable, not a scalar (because it is a transformations (function) of random variables)

  • variance is a scalar
  • sample variance is a random variable

Lemma (expectation of random sample) Let \(X_1, ..., X_n\) be a random sample and \(g(x)\) be a function such that \(Eg(X)\) and \(Var(g(X))\) exists then

\[E(\sum_{i=1}^n g(X_i)) = n E(g(X_1)) \]
\[\mathrm{Var}(\sum_{i=1}^{n} g(X_i)) = n (\mathrm{Var}(g(X_1)))\]

Note: the second one is proved using zero covariance between iid variables

Definition (statistical model) A statistical model is a set of distributions.

Definition (parametric, non parametric) A parametric model is a set of statistical models that can be parameterized by a finite number of parameters (e.g: normal distribution). A nonparametric model is a set that cannot be.

2.1. Sample Mean

Definition (sample mean) The sample mean is the statistic defined by

\[\bar{X} = \frac{X_1 + ... + X_n}{n}\]

Lemma (mean, variance of sample mean) Let \(X_1, ..., X_n\) be a random sample with mean \(\mu\) and variance $\sigma^2 < \infty $

\[E\bar{X} = \mu\]
\[\mathrm{Var}(\bar{X}) = \frac{\sigma^2}{n}\]

Therefore, \(\bar{X}\) is an unbiased estimator of \(\mu\).

The multivariate version: Let \(\mathbf{X}_1, ..., \mathbf{X}_n\) be a random sample from a joint distribution with mean vector \(\mathbf{\mu}\), and covariance matrix \(\Sigma\)

\[E\bar{\mathbf{X}} = \mathbf{\mu}\]
\[\mathrm{Cov}(\bar{\mathbf{X}}) = \frac{1}{n}\mathbf{\Sigma}\]

anti-pattern in Cauchy distribution

Unlike the variance of sample mean is reduced by the sample size. It is important to note that the dispersion of the sample mean \(\bar{X}\) of Cauchy distribution \(Cauchy(\mu, \sigma)\), which is measured by \(\sigma\) is invariant of the sample size. Cauchy distribution does not have mean or variance.

variance of sample mean in the case of sampling without replacement

As mentioned previously, a sample in the case of sampling without replacement is no longer independent. The variance of sample mean is

\[\mathrm{Var}(\bar{X}) = \frac{\sigma^2}{n}\frac{N-n}{N-1}\]

Notice it is smaller than the previous infinite case.

Lemma (pdf of sample mean) By applying transformation, we can obtain

\[f_{\bar{X}} = nf_X(nx)\]

Lemma (mgf of sample mean) Let \(X_1, ..., X_n\) be a random sample from a population with mgf \(M_X(t)\). Then the mgf of the sample mean is

\[M_{\bar{X}}(t) = [ M_X(t/n) ]^n\]

sample mean of Gamma distribution

Suppose \(X_i \sim gamma(\alpha, \beta)\), then the mgf of the sample mean is

\[M_{\bar{X}} (t) = (\frac{1}{1-(\beta/n)t})^{n\alpha}\]

which shows that \(\bar{X} \sim gamma(n\alpha, \beta/n)\)

Weak Law of Large Number

Let \(X_1, ..., X_n\) be iid random variables with mean and variance, then for every \(\epsilon\)

\[\lim_{n \to \infty} P(|\bar{X}_n - \mu| < \epsilon) = 1\]

central limit theorem

Let \(X_1, ..., X_n\) be iid random variables with mean and variance, let \(G_n(x)\) denote the cdf of \(\sqrt{n}(\bar{X}_n - \mu)/\sigma\), then

\[\lim_{n \to \infty} G_n(x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi}} e^{-y^2/2} dy\]

that is, \(\sqrt{n}(\bar{X}_n - \mu)/\sigma\) has a limiting standard normal distribution.

2.2. Sample Variance

Definition (sample variance) The sample variance is the statistic defined by

\[S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})^2\]

By the way, this can be computed with statistics.variance in python.

A useful transformation to compute \((n-1)S^2\) is

\[\sum_i (X_i - \bar{X})^2 = \sum_i X_i^2 - n\bar{X}^2\]

Notice the analogy between this and variance computing.

\[Var(X) = EX^2 - (EX)^2\]

Lemma (mean, variance of sample variance)

\[ES^2 = \sigma^2\]
\[\text{Var}(S^2) = \frac{1}{n} (\theta_4 - \frac{n-3}{n-1}\theta^2_2)\]

where \(\theta_i\) is the i-th central moment. The first part is easy, the second part is crazy, see solution of Casella Berger exercise 5.8.

Therefore, \(S^2\) is an unbiased estimator of \(\sigma^2\)

The multivariate version

\[E\mathbf{S^2} = \mathbf{\Sigma}\]

sample variance of normal distribution

In the case of normal distribution, the sample mean and variance have several good properties as follows:

Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) distribution. Then

\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1)\]

Taking variance on both side, we obtain

\[Var(S^2) = \frac{2\sigma^4}{n-1}\]

Note this can also be obtained by computing \(\theta_4\) from the general variance formula.

independence of \(\bar{X}, S^2\) in normal distribution

\(\bar{X}, S^2\) are independent in normal distribution

Note that in general, \(\bar{X}, S^2\) are not always independent, it requires the 3rd central moment to be 0 to be independent.

2.3. Sample Order

Definition (order statistics) The order statistics of a random sample \(X_1, ..., X_n\) are sample values placed in ascending order. They are denoted by \(X_{(1)}, ..., X_{(n)}\).

Definition (sample range) the sample range is the distance between the smallest and largest observtions.

\[R=X_{(n)}-X_{(1)}\]

Definition (sample median) the sample median \(M\) is defined as

\[M = \begin{cases} X_{((n+1)/2)} & \text{ if n is odd} \\ (X_{(n/2)} + X_{(n/2+1)}) /2 & \text{ if n is even} \\ \end{cases}\]

One advantage of sample median over the sample mean is that it is less affected by extreme observations.

Other related statistics are lower quartile and upper quartile.

Proposition (discrete order statistics) Let \(X_1, ..., X_n\) be a random sample from a discrete distribution with pmf \(f_X(x_i) = p_i\) where \(x_1 < x_2 < ...\) are possible values of \(X\) in ascending order. Let \(P_i = \sum_{k=1}^i p_k\), then cdf of order statistics is

\[P(X_{(j)} \leq x_i) = \sum_{k=j}^n \binom{n}{k} P_i^k (1-P_i)^{(n-k)}\]

and the pmf is

\[P(X_{(j)} = x_i) = \sum_{k=j}^n \binom{n}{k} [ P_i^k (1-P_i)^{(n-k)} - P_{i-1}^k (1-P_{i-1})^{(n-k)}]\]

Proposition (continuous order statistics) Consider the continuous population with cdf \(F_X(x)\) and pdf \(f_X(x)\), then the cdf is

\[F_{X_{(j)}} (x) = \sum_{k=j}^n \binom{n}{k} [F_X(x)]^k [1-F_X(x)]^{n-k}\]

By differentiating this one, we can get

\[f_{X_{(j)}}(x) = \frac{n!}{(j-1)!(n-j)!} f_X(x) [F_X(x)]^{j-1} [1-F_X(x)]^{n-j}\]

Proposition (joint pdf) The joint pdf of all the order statistics is given by

\[f_{X_{(1)},..., X_{(n)}}(x_1, ..., x_n) = n! \prod_i f_X(x_i)\]

if \(x_1 < ... < x_n\), otherwise 0.

3. Data Reduction Principles

Data reduction in terms of a particular statistic \(T\) can be thought of as a partition of the sample space into \(A_t = \{ x | T(x)=t \}\). The partition should not discard important information about the unknown parameter \(\theta\). Instead of reporting \(x\), we only report its partition \(T(x)\) (whose size is much smaller than \(x\))

Definition (experiment) Formally an experiment \(E\) is defined to be

\[E = (X, \theta, \{ f(x|\theta) \})\]

Definition (evidence) An experimenter, knowing what experiment \(E\) is performed and having observed a sample \(x\), can make some inference or draw some conclusion about \(\theta\), which denoted by

\[Ev(E, x)\]

which stands for the evidence about \(\theta\) arising from \(E, x\)

3.1. The sufficiency principle

The sufficiency principle says the sufficient statistics summarizes all the information about \(\theta\) in the sample. (You can only keep \(T(x)\) and discard \(x\) to save your memory)

Principle (sufficiency) Consider an experiment \(E=(X, \theta, f(x|\theta))\) and suppose \(T(X)\) is a sufficient statistics for \(\theta\).

\[T(x) = T(y) \implies Ev(E,x) = Ev(E,y)\]

Intuitively, this means: If \(\mathbf{x}, \mathbf{y}\) are two sample points such that \(T(\mathbf{x}) = T(\mathbf{y})\), then the inference about \(\theta\) should be the same whether \(\mathbf{X}=\mathbf{x}\) or \(\mathbf{X}=\mathbf{y}\) is observed

The sufficient statistic is defined as follows

Definition (sufficient statistic) A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) if the conditional distribution of the sample \(\mathbf{X}\) given the value of \(T(\mathbf{X})\) does not depend on \(\theta\)

Criterion (condition of sufficient statistics) If the following ratio is constant as a function of \(\theta\), then \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\)

\[\frac{p(\mathbf{x}|\theta)}{q(T(\mathbf{x})|\theta)}\]

Bernoulli sufficient statistic

Let \(X_1, ..., X_n\) be Bernoulli random variables with parameter \(\theta\). Then one of the sufficient statistic for \(\theta\) is

\[T(X) = X_1 + ... + X_n\]

This can be proved by consider \(p(x|\theta)\) and \(q(T(x)\theta)\)

\[p(x|\theta) = \pi_i \theta^x_i (1-\theta)^{1-x_i} = \theta^t (1-\theta)^{n-t}\]
\[q(t|\theta) = \binom{n}{t} \theta^{t} (1-\theta)^{n-t}\]

Their ratio is constant with respect to \(\theta\)

Criterion (Fisher-Neyman factorization) Let \(f(\mathbf{x}|\theta)\) denote the joint pdf of a sample \(\mathbf{X}\). A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) iff

\[f(\mathbf{x}|\theta) = g(T(\mathbf{x}) | \theta) h(\mathbf{x})\]

Note that \(g(T(x)|\theta)\) is not necessarily a distribution.

sufficient statistics of exponential distribution

Let \(X_1,..., X_n\) from a exponential family

\[f(x|\theta)= h(x)c(\theta) \exp (\sum^k_{i=1} w_i(\theta) t_i(x))\]

then the following is a sufficient statistic for \(\theta\)

\[T(X) = ( \sum_j t_1(X_j), ..., \sum_j t_k(X_j))\]

It turns out that outside of the exponential family of distributions, it is rare to have a sufficient statistic of smaller dimension than the size of the sample, so in many cases the order statistics are the best that we can do.

Sufficient statistic can be interpreted with the concept of sufficient partition

Definition (sufficient partition) A partition \(B_1, ..., B_k\) is called sufficient partition if \(f(x|X \in B)\) does not depend on \(\theta\).

A statistic \(T\) induces a partition. \(T\) is sufficient iff its partition is sufficient. If we get finer partition from one sufficient partition, it is also sufficient partition (statistics).

3.1.1. Minimal Sufficient statistics

If there are more than one sufficient statistics, the sufficient statistics which achieve the most data reduction is the minimal sufficient statistic.

Their partition is corresponding to the most coarse sufficient partition. A more coarse partition will remove the sufficiency (it will depend on \(\theta\))

Definition (minimal sufficient statistics) A sufficient statistics \(T(\mathbf{X})\) is called a minimal sufficient statistics if, for any other sufficient statistics \(T'(\mathbf{X})\), \(T(x)\) is a function of \(T'(x)\)

Note that the minimal sufficient statistics is also not unique, but its partition is unique

Criterion (condition of minimality) \(T(\mathbf{X})\) is a minimal sufficient statistics for \(\theta\) when the ratio \(f(x|\theta)/f(y|\theta)\) is constant as a function of \(\theta\) iff \(T(x)=T(y)\)

Poisson minimumal sufficient statistics

Recall that joint pdf of Poisson is

\[P(x_1, ..., x_n) = \frac{e^{-n\theta} \theta^{\sum_i x_i}}{\prod_i x_i!}\]

The ratio is

\[\frac{p(x_1, ..., x_n)}{p(y_1, ..., y_n)} = \frac{\prod_i y_i! \theta^{\sum_i x_i - \sum_i y_i}}{\prod_i x_i!}\]

This ratio is independent of \(\theta\) iff \(\sum_i x_i = \sum_i y_i\), therefore the following \(T\) is a minimal sufficient statistic

\[T(x_1,..., x_n) = \sum_i X_i\]

3.1.2. Complete Statistics

Ancillary statistics are complementary to sufficient statistics

Definition (ancillary statistics) A statistic \(S(X)\) whose distribution does not depend on the parameter \(\theta\) is called an ancillary statistic

Location family and scale family

let \(X_1,..., X_n\) be a sample from a location parameter family with cdf \(F(x-\theta)\), then

\[R = X_{(n)} - X_{(1)}\]

is an ancillary statistic

If it is from a scale parameter family with cdf \(F(x/\sigma)\), then any statistic whose function form depends on \(X_1 / X_n, ..., X_{n-1}/X_n\) is an ancillary statistics, for example

\[\frac{\sum_i X_i}{X_n}\]

Definition (complete statistic) Let \(f(t|\theta)\) be a family of pdfs or pmfs for a statistic \(T(X)\), the family of probability distributions is called complete if

\[\forall(\theta)E_{\theta}(g(T)) = 0 \implies \forall(\theta) P_{\theta}(g(T)=0) = 1\]

Theorem (Basu) If \(T(X)\) is a complete and minimal sufficient statistic, then \(T(X)\) is independent of every ancillary statistic.

3.1.3. Sufficiency and Risk Reduction

Suppose we observe \(X_1, ..., X_n \sim p(X; \theta)\) and we would like to estimate \(\theta\) with \(\hat{\theta}(X_1, ..., X_n)\). In order to evaluate the estimator, we have the risk

\[R(\hat{\theta}, \theta) = E(\hat{\theta}-\theta)^2\]

The estimator that does not depend on sufficient statistics can be improved. Intuitively, this is to remove the variance within each sufficient partition by averaging them.

Theorem (Rao-Blackwell) le us define a new estimator using sufficient statistics

\[\tilde{\theta} =E(\hat{\theta} | T)\]

then it will always improve the risk

\[R(\tilde{\theta}, \theta) \leq R(\hat{\theta}, \theta)\]

Proof By applying Jensen and the law of total expectation,

\[R(\tilde{\theta}, \theta) = E((E(\hat{\theta}|T)- \theta)^2) = E((E(\hat{\theta} - \theta|T))^2) \leq EE((\hat{\theta}-\theta)^2|T) = R(\hat{\theta}, \theta)\]

Bernoulli estimator

Let \(X_1, ..., X_n \sim Ber(\theta)\), we can improve the estimator \(\hat{\theta} = X_1\) using the sufficient statistics \(T = \sum_i X_i\), then

\[\tilde{\theta} = E(X_1 | \sum_i X_i=k) = \frac{k}{n}\]

Both estimators are unbiased, but the new estimator reduces the variance from \(\theta(1-\theta)\) to \(\theta(1-\theta)/n\)

3.2. The Likelihood Principle

The likelihood principle says that the likelihood function summarizes all information about \(\theta\) from the sample. (You can only keep \(L(\theta|x)\) and discard \(x\) to save your memory)

Principle (likelihood) Suppose we have two experiments \(E_1 = (X_1, \theta, \{ f_1(x_1|\theta) \}), E_2= (X_2, \theta, \{ f_2(x_2|\theta) \})\) where \(\theta\) is same in both experiments. Suppose \(x^*_1, x^*_2\) are two sample points from \(E_1, E_2\) such that their likelihoods function are propostional

\(\(L(\theta | x^*_2) = CL(\theta | x^*_1)\)\) then

\[Ev(E_1, x^*_1) = Ev(E_2, x^*_2))\]

Definition (likelihood function) Let \(f(x|\theta)\) denoted the joint pdf or pmf of the sample \(X\), given that \(X=x\) is observed, the function of \(\theta\) defined as follows is called the likelihood function

\[L(\theta|x) = f(x|\theta)\]

The likelihood principles can be derived from the sufficiency principles and conditionality principles, the latter says that if one of two experiments is randomly chosen and done, the information about \(\theta\) depends only on the experiment performed.

relationship to sufficieny

Using the factorization theorem, we can see for any sufficient statistics, we can write

\[L(\theta) = g(T(x_1, ..., x_n); \theta)h(x_1, ..., x_n)\]

where \(h\) does not depend on \(h\) and can be ignored, therefore

\[L(\theta) \propto g(T(x_1, ..., x_n); \theta)\]

Thus, once we have any sufficient statistics, we have everything we need to compute the likelihood function and we lost nothing

4. Measure Concentration

4.1. Inequalities and Tail Bound

The following inequalities are mostly about the bound of distribution tail.

Inequality (Markov) Let \(X\) be a nonnegative random variable (i.e. \(X \geq 0\)). Then

\[P(X \geq t) \leq \frac{EX}{t} = O(\frac{1}{t})\]

Chebyshev can be obtained from Markov inequality by taking the absolute value. (to maintain the nonnegativity)

It has a better bound \(O(\frac{1}{t^2})\) than Markov.

Inequality (Chebyshev) Let \(X\) be a random variable and \(EX = \mu, \mathrm{Var} X = \sigma^2\), then for any \(t > 0\)

\[P( | X - \mu | \geq t \sigma) \leq \frac{1}{t^2} = O(\frac{1}{t^2})\]

Chernoff bound can also be obtained from Markov inequality by taking exp.

Chebyshev's inequality is conservative and it has an better exponential bound.

Inequality (Chernoff)

\[P((X - \mu) \geq u) = P(\exp(t(X-\mu)) \geq \exp(tu)) \leq \frac{E[\exp(t(X-\mu)]}{\exp(tu)}\]

Suppose mgf exist in a neighbourhood of \(b\), then

\[P((X-\mu) \geq u) \leq \inf_{0 \leq t \leq b} \exp(-t(u+\mu)) E[\exp(tX)]\]

Gaussian tail bound

By applying Chernoff method to Gaussian, we can obtain

\[P(X-\mu > u) \leq \inf_t \exp(-t(\mu+u))M_X(t) = \exp(-\frac{u^2}{2\sigma^2})\]

By considering symmetry, we get

\[P(|X-\mu| > u) \leq 2\exp(-\frac{u^2}{2\sigma^2})\]

Definition (sub-Gaussian random variable) A sub-Gaussian random variable whose tails decay faster or equal than a Gaussian. Formally, a random variable \(X\) with mean \(\mu\) is sub-Gaussian if there exists a positive number \(\sigma\) such that

\[E[\exp(t(X-\mu))] \leq \exp(\sigma^2 t^2/2)\]

It is easy to see sub-Gaussian has the two sided exponential tail bound

\[P(|X-\mu| > u) \leq 2\exp(-\frac{u^2}{2\sigma^2})\]

iid sample mean of sub-Gaussian

Let \(X_1, ..., X_n\) be \(\sigma\)-sub Gaussian RV, then it can be shown

\[E\exp(t(\bar{X}-\mu)) \leq \exp(t^2\sigma^2/n)\]

which says the sample mean is a \(\sigma/\sqrt{n}\) sub-Gaussian

general sample mean of sub-Gaussian

Let \(X_1, ..., X_n\) be \(\sigma_1, ..., \sigma_n\) sub-Gaussian, then the sample mean is \(\sigma\) sub-Gaussian with

\[\sigma = \frac{1}{n}\sqrt{\sum_i \sigma^2_i}\]

It can be shown (using Radamacher random variable) that a \([a,b]\) bounded random variable is \(b-a\) sub-Gaussian. Then the sample mean is \((b-a)/\sqrt{n}\) sub-Gaussian, then we have the Hoeffding bound

Inequality (Hoeffding) Suppose a random variable is bound between \([a,b]\), then

\[P(|\bar{X} - \mu| < t) \leq 2 \exp (-\frac{2nt^2}{(b-a)^2})\]

Inequality (Hölder) Let \(X,Y\) be any two random variables and \(1/p + 1/q = 1\), then

\[|EXY| \leq E |XY| \leq (E|X|^p)^{1/p} (E|Y|^q)^{1/q}\]

Inequality (Cauchy-Schwarz) When \(p=q=2\), it is the Cauchy-Schwarz inequality

\[|EXY| \leq E|XY| \leq (E|X|^2)^{1/2} (E|Y|^2)^{1/2}\]

Inequality (Minkowski) Let \(X,Y\) be any two random variables and $ 1 \leq p \leq \infty$

\[(E|X+Y|^p)^{1/p} \leq (E|X|^p)^{1/p} + (E|Y|^{p})^{1/p}\]

Inequality (Jensen) For any random variable \(X\), if \(g(x)\) is a convex function, then

\[Eg(X) \geq g(EX)\]

4.2. Empirical Process

First we consider the problem of estimating a CDF given a random sample, suppose \(X_1, ..., X_n \sim F_X\), the empirical CDF seems to be a natural estimator

\[\hat{F}_n(x) = \frac{1}{n} \sum_i \mathbf{1}(X_i \leq x)\]

We can see that the empirical cdf can converge to the true cdf pointwisely

Lemma (consistency of empirical CDF) For a fixed \(x\), it is easy to see

\[E\hat{F}_n(x) = \frac{1}{n} \sum_i E\mathbf{1}(X_i \leq x) = P(X \leq x) = F_X(x)\]

By Strong Law of Large Numbers, it converges to \(F_X(x)\) almost surely for every \(x\) pointwisely, therefore it is consistent

\[\hat{F}_n(x) \to F(x)\]

Note that the SLLN's condition here is \(E|\mathbf{1}(X_i \leq x)| < \infty\) which is obviously

A more stronger result is that this convergence happens uniformly

Theorem (Glivenko-Cantelli)

\[||\hat{F}_n - F||_{\infty} = \sup_x |\hat{F}_n(x) - F(x)| \xrightarrow{a.s} 0\]

Note that this makes no assumption of the CDF.

Applications of Glivenko-Cantelli

we can construct estimators of various parameters of interests \(\theta(F)\) by replacing \(F\) with \(F_n\), these are called plug-in estimators. For example, median

\[\theta(F) = \inf \{x | F(x) \geq 1/2 \}\]

To

\[\theta(\hat{F}) = \inf \{ x| \frac{1}{n}\sum_i \mathbf{1}[X_i \leq x] \geq 1/2 \}\]

If \(\theta\) is continuous wrt \(||\cdot||_\infty\), then \(\theta(\hat{F}) \to \theta(F)\) almost surely

Rewrite the previous conclusion in terms of the set, we have

\[\Delta(\mathcal{A}) = \sup_{A \in \mathcal{A}} |P_n(A) - P(A)|\]

where \(\mathcal{A} = \{ (-\infty, x] | x \in R \}\)

we are interested in collections of set \(\mathcal{A}\), for which we have the uniform convergence

\[\Delta(\mathcal{A}) \xrightarrow{P} 0\]

That class satisfying this property is called Glivenko-Cantelli class, in general, very complex classes of functions or sets fail to be Glivenko-Cantelli class.

Definition (Glivenko-Cantelli class) \(F\) is a Glivenko-Cantelli class if

\[\sup_{f \in F} |\frac{1}{n} \sum_i f(X_i) - Ef| \xrightarrow{P} 0\]

One example of GC class is the CDF

\[F = \{ \mathbf{1}(-\infty, x] | x \in R \}\]

counterexample of GC class

Not every class is a GC class, consider the set \(\mathcal{A}\) which contains all subset of \([0,1]\) with finitely many elements. It is easy to see there exists a set \(A \in \mathcal{A}\) such that \(P_n(A) = 1\) and \(P(A) = 0\)

\[\sup_{A \in \mathcal{A}} |P_n(A) - P(A)| = 1\]

which fails to be a GC class. This roughly means it is too large

We actually have another stronger tail bound theorem than the Glivenko-Cantelli

Theorem (Dvoretzky-Kiefer-Wolfowitz) For any \(\epsilon > 0\) and all \(n\), we have

\[P(||\hat{F}_n - F||_{\infty} > \epsilon) \leq 2 \exp(-2n\epsilon^2)\]

Notice this is similar to the conclusion of the Hoeffeding bound, which is related to the pointwise convergence

\[P(|\hat{F}_n(x) - F(x)| > \epsilon) \leq 2 \exp(-2n\epsilon^2)\]

confidence interval using DKW inequality

We can construct a confidence set using DKW inequality, let \(\epsilon^2_n = \log(2/\alpha)/2n\), \(L(x) = \max(\hat{F}_n(x), 0), U(x) = \min(\hat{F}_n(x), 0)\), then for all \(x \in R\) and \(F\)

\[P(L(x) \leq F(x) \leq U(x)) \geq 1-\alpha\]

where \((L(X), U(X))\) is a nonparametric \(1-\alpha\) confidence band

5. Resampling

5.1. Bootstrap

Suppose given a sample \(X_1, ..., X_n \sim P\), we can define the empirical distribution

\[P_n(A) = \sum^n_{i=1} 1(X_i \in A)\]

Definition (bootstrap sample) We can draw the bootstrap sample from the empirical distribution uniformly with replacement

\[X_1^*, ..., X_n^* \sim P_n\]

Suppose we have an estimator \(\hat{\theta} = g(X_1, ..., X_n)\) and we want to estimate its variance \(Var(\hat{\theta})\), however, it is diffult as we do not know the \(P\), instead we use the bootstrap variance estimator

Definition (bootstrap variance estimator)

  • draw sample \(X_1^*, ..., X_n^* \sim P_n\). Compute \(\hat{\theta}_n^* = g(X_1^*, ..., X_n^*)\)
  • repeat the process to yield \(\hat{\theta}_{n, 1}^*, ..., \hat{\theta}_{n, B}^*\)
  • compute
\[\hat{s}^2 = \frac{1}{B} \sum_j (\hat{\theta}^*_{n, j} - \bar{\theta})^2\]

where \(\bar{\theta} = \sum_j \hat{\theta}^*_{n, j} / B\)

We want to justify the bootstrap: quantiles of the bootstrap distribution of our statistics is close to the quantile of actual distribution. suppose we define

\[\hat{F}_n(t) = P_n(\sqrt{n}(\hat{\theta}^*_n - \hat{\theta}_n) > t)\]

and the true sampling distribution to be

\[F_n(t) = P(\sqrt{n}(\hat{\theta}_n - \theta) > t)\]

the bootstrap works for example,

\[\sup_t |\hat{F}_n - F| \to 0\]

5.2. Jackknife

The Jackknife is a simple method for approximating the bias and variance of an estimator from each subsample of size \(n-1\) by omitting one observation

Definition (Jackknife bias estimate) The jackknife bias estimate of an estimator \(T_n = T(X_1, ..., X_n)\) is defined by

\[b_{jack} = (n-1)(\bar{T}_n - T_n)\]

where \(\bar{T}_n = \sum_i T_{(-i)}/n\) and \(T_{(-i)}\) denote the statistic with the \(i\)-th observation removed.

Using this bias estimate, we can correct \(T_n\) with \(T_{jack}\) with

\[T_{jack} = T_n - b_{jack}\]

For many statistics with the bias of order \(O(1/n)\), this correction reduces the bias to \(O(1/n^2)\)

We define the pseudo-values as \(\tilde{T}_i = nT_n - (n-1)T_{(-i)}\), then

\[T_{jack} = \frac{1}{n} \sum \tilde{T}_i\]

and we can estimate variance of \(T_n\) as well

Definition (Jackknife variance estimate) the variance estimate of \(Var(T_n)\) is

\[v_{jack} = \tilde{s}^2 / n\]

where \(\tilde{s}^2\) is the sample variance of the pseudo-values:

\[\tilde{s}^2 = \frac{\sum (\tilde{T}_i - \frac{1}{n} \sum \tilde{T}_i)^2}{n-1}\]

This is consistent with \(Var(T_n)\) under certain conditions on \(T\)