# 0x400 Foundation

- 1. Sampling Distribution
- 2. Sampling Statistics
- 3. Data Reduction Principles
- 4. Measure Concentration
- 5. Resampling

## 1. Sampling Distribution

### 1.1. Chi-Square

**Distribution (\(\chi^2\) distribution)** \(\chi^2\) distribution is a special case of gamma distribution. The chi square distribution with p degrees of freedom has pdf of Gamma\((p/2, 2)\)

Note that when \(p=1\), the distribution can be computed by applying \(Y=X^2\) to normal distribution. (Cassela Example 2.1.9)

**properties (\(\chi^2\) distribution)** By applying Gamma distribution, we obtain following results

**Lemma** If \(Z\) is a n(0,1) random variable \(Z^2 \sim \chi_1^2\)

**Lemma** If \(X_1, ..., X_n\) are independent and \(X_i \sim \chi_{p_i}^2\), then \(X_1 + ... +X_n \sim \chi^2_{p1+...+pn}\)

**Distribution (noncentral \(\chi^2\) distribution)** Let \(X_1, X_2, ..., X_n\) be an independent random variables where \(X_i \sim N(\mu_i, \sigma^2)\), then the distribution \(Y\)

is called to have a noncentral \(\chi^2\) distribution

where the first parameter is again the degree of freedom, the second parameter is called the noncentrality parameter

### 1.2. T distribution

In most practical cases, the variance \(\sigma^2\) is unknown. thus to get any idea of the variability of \(\hat{X}\), we need to estimate this variance first.

**Distribution (Student's t, Gosset)** Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) distribution. The quantity \((\bar{X}-\mu)/(S/\sqrt{n})\) has Student's t distribution with \(n-1\) degrees of freedom. It has pdf

The distribution is a distribution of \(U/\sqrt(V/p)\) where \(U \sim n(0,1), V \sim \chi^2_p\)

Student's t has no mgf because it does not have moments of all orders. If there are \(p\) degrees of freedom, then there are only \(p-1\) moments.

t distribution has an important property robustness, which means it is much less sensitive to the outlines than Gaussian distributions

**Distribution (non-central t distribution)** Let \(X \sim N(\mu, 1)\) and \(Y \sim \chi^2(n)\), then

has a noncentral t distribution with \(\mu\) noncentrality parameter

### 1.3. F distribution

**Distribution (Snedecor's F, Fisher)** Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu_X, \sigma^2_X)\) population and \(Y_1, ..., Y_m\) from \(n(\mu_Y, \sigma^2_Y)\) population. The random variable

has a Snedecor's F distribution with \(n-1, m-1\) degrees of freedom.

**Lemma**

If \(X \sim F_{(p,q)}\), then \(1/X \sim F_{(q,p)}\)

If \(X \sim t_q\), then \(X^2 \sim F_{(1,q)}\)

## 2. Sampling Statistics

Sample is a sequence of iid random variable, statistic is its function.

**Definition (random sample)** The collection of random variables \(X_1, X_2, ... X_n\) is said to be a random sample of size \(n\) if they are independent and identically distributed (i.i.d), i.e

- \(X_1, ..., X_n\) are independent random variables
- They have same distribution

\(x_1, x_2, ..., x_n\) is said to their realizations. The joint pdf or pmf or \(X_1, ..., X_n\) is given by

This random sampling model is called sampling from an infinite population. Its assumption is that, after observing \(X_1 = x_1\), the probability distribution \(X_2\) is unaffected.

In the case of sampling from a finite population, there are two cases

*sampling with replacement*: this meets the previous definition*sampling without replacement*: the random variables are no longer independent (but still identically distributed). When population is large enough, it can still be approximated with the previous condition.

**Definition (statistic, sampling distribution)** The random variable \(Y\) defined over a random sample \(X_1, ..., X_n\) and a real-valued function \(T\) is called a ** statistic**,

The statistics, however, cannot depend on any parameter \(\theta\)!!

The distribution of a statistic \(Y\) is called the ** sampling distribution** of \(Y\). Once the sample is drawn, the \(t\) is called the realization of \(T\), where

statistics is random variable

I always get confused, but be careful that statistic is a random variable, not a scalar (because it is a transformations (function) of random variables)

- variance is a scalar
- sample variance is a random variable

**Lemma (expectation of random sample)** Let \(X_1, ..., X_n\) be a random sample and \(g(x)\) be a function such that \(Eg(X)\) and \(Var(g(X))\) exists then

Note: the second one is proved using zero covariance between iid variables

**Definition (statistical model)** A statistical model is a set of distributions.

**Definition (parametric, non parametric)** A parametric model is a set of statistical models that can be parameterized by a finite number of parameters (e.g: normal distribution). A nonparametric model is a set that cannot be.

### 2.1. Sample Mean

**Definition (sample mean)** The sample mean is the statistic defined by

**Lemma (mean, variance of sample mean)** Let \(X_1, ..., X_n\) be a random sample with mean \(\mu\) and variance $\sigma^2 < \infty $

Therefore, \(\bar{X}\) is an *unbiased estimator* of \(\mu\).

The multivariate version: Let \(\mathbf{X}_1, ..., \mathbf{X}_n\) be a random sample from a joint distribution with mean vector \(\mathbf{\mu}\), and covariance matrix \(\Sigma\)

anti-pattern in Cauchy distribution

Unlike the variance of sample mean is reduced by the sample size. It is important to note that the dispersion of the sample mean \(\bar{X}\) of Cauchy distribution \(Cauchy(\mu, \sigma)\), which is measured by \(\sigma\) is invariant of the sample size. Cauchy distribution does not have mean or variance.

variance of sample mean in the case of sampling without replacement

As mentioned previously, a sample in the case of sampling without replacement is no longer independent. The variance of sample mean is

Notice it is smaller than the previous infinite case.

**Lemma (pdf of sample mean)** By applying transformation, we can obtain

**Lemma (mgf of sample mean)** Let \(X_1, ..., X_n\) be a random sample from a population with mgf \(M_X(t)\). Then the mgf of the sample mean is

sample mean of Gamma distribution

Suppose \(X_i \sim gamma(\alpha, \beta)\), then the mgf of the sample mean is

which shows that \(\bar{X} \sim gamma(n\alpha, \beta/n)\)

Weak Law of Large Number

Let \(X_1, ..., X_n\) be iid random variables with mean and variance, then for every \(\epsilon\)

central limit theorem

Let \(X_1, ..., X_n\) be iid random variables with mean and variance, let \(G_n(x)\) denote the cdf of \(\sqrt{n}(\bar{X}_n - \mu)/\sigma\), then

that is, \(\sqrt{n}(\bar{X}_n - \mu)/\sigma\) has a limiting standard normal distribution.

### 2.2. Sample Variance

**Definition (sample variance)** The sample variance is the statistic defined by

By the way, this can be computed with statistics.variance in python.

A useful transformation to compute \((n-1)S^2\) is

Notice the analogy between this and variance computing.

**Lemma (mean, variance of sample variance)**

where \(\theta_i\) is the i-th central moment. The first part is easy, the second part is crazy, see solution of Casella Berger exercise 5.8.

Therefore, \(S^2\) is an *unbiased estimator* of \(\sigma^2\)

The multivariate version

sample variance of normal distribution

In the case of normal distribution, the sample mean and variance have several good properties as follows:

Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) distribution. Then

Taking variance on both side, we obtain

Note this can also be obtained by computing \(\theta_4\) from the general variance formula.

independence of \(\bar{X}, S^2\) in normal distribution

\(\bar{X}, S^2\) are independent in normal distribution

Note that in general, \(\bar{X}, S^2\) are not always independent, it requires the 3^{rd} central moment to be 0 to be independent.

### 2.3. Sample Order

**Definition (order statistics)** The order statistics of a random sample \(X_1, ..., X_n\) are sample values placed in ascending order. They are denoted by \(X_{(1)}, ..., X_{(n)}\).

**Definition (sample range)** the sample range is the distance between the smallest and largest observtions.

**Definition (sample median)** the sample median \(M\) is defined as

One advantage of sample median over the sample mean is that it is less affected by extreme observations.

Other related statistics are lower quartile and upper quartile.

**Proposition (discrete order statistics)** Let \(X_1, ..., X_n\) be a random sample from a discrete distribution with pmf \(f_X(x_i) = p_i\) where \(x_1 < x_2 < ...\) are possible values of \(X\) in ascending order. Let \(P_i = \sum_{k=1}^i p_k\), then cdf of order statistics is

and the pmf is

**Proposition (continuous order statistics)** Consider the continuous population with cdf \(F_X(x)\) and pdf \(f_X(x)\), then the cdf is

By differentiating this one, we can get

**Proposition (joint pdf)** The joint pdf of all the order statistics is given by

if \(x_1 < ... < x_n\), otherwise 0.

## 3. Data Reduction Principles

Data reduction in terms of a particular statistic \(T\) can be thought of as a partition of the sample space into \(A_t = \{ x | T(x)=t \}\). The partition should not discard important information about the unknown parameter \(\theta\). Instead of reporting \(x\), we only report its partition \(T(x)\) (whose size is much smaller than \(x\))

**Definition (experiment)** Formally an experiment \(E\) is defined to be

**Definition (evidence)** An experimenter, knowing what experiment \(E\) is performed and having observed a sample \(x\), can make some inference or draw some conclusion about \(\theta\), which denoted by

which stands for the evidence about \(\theta\) arising from \(E, x\)

### 3.1. The sufficiency principle

The sufficiency principle says the sufficient statistics summarizes all the information about \(\theta\) in the sample. (You can only keep \(T(x)\) and discard \(x\) to save your memory)

**Principle (sufficiency)** Consider an experiment \(E=(X, \theta, f(x|\theta))\) and suppose \(T(X)\) is a sufficient statistics for \(\theta\).

Intuitively, this means: If \(\mathbf{x}, \mathbf{y}\) are two sample points such that \(T(\mathbf{x}) = T(\mathbf{y})\), then the inference about \(\theta\) should be the same whether \(\mathbf{X}=\mathbf{x}\) or \(\mathbf{X}=\mathbf{y}\) is observed

The sufficient statistic is defined as follows

**Definition (sufficient statistic)** A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) if the conditional distribution of the sample \(\mathbf{X}\) given the value of \(T(\mathbf{X})\) does not depend on \(\theta\)

**Criterion (condition of sufficient statistics)** If the following ratio is constant as a function of \(\theta\), then \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\)

Bernoulli sufficient statistic

Let \(X_1, ..., X_n\) be Bernoulli random variables with parameter \(\theta\). Then one of the sufficient statistic for \(\theta\) is

This can be proved by consider \(p(x|\theta)\) and \(q(T(x)\theta)\)

Their ratio is constant with respect to \(\theta\)

**Criterion (Fisher-Neyman factorization)** Let \(f(\mathbf{x}|\theta)\) denote the joint pdf of a sample \(\mathbf{X}\). A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) iff

Note that \(g(T(x)|\theta)\) is not necessarily a distribution.

sufficient statistics of exponential distribution

Let \(X_1,..., X_n\) from a exponential family

then the following is a sufficient statistic for \(\theta\)

It turns out that outside of the exponential family of distributions, it is rare to have a sufficient statistic of smaller dimension than the size of the sample, so in many cases the order statistics are the best that we can do.

Sufficient statistic can be interpreted with the concept of sufficient partition

**Definition (sufficient partition)** A partition \(B_1, ..., B_k\) is called sufficient partition if \(f(x|X \in B)\) does not depend on \(\theta\).

A statistic \(T\) induces a partition. \(T\) is sufficient iff its partition is sufficient. If we get finer partition from one sufficient partition, it is also sufficient partition (statistics).

#### 3.1.1. Minimal Sufficient statistics

If there are more than one sufficient statistics, the sufficient statistics which achieve the most data reduction is the minimal sufficient statistic.

Their partition is corresponding to the most coarse sufficient partition. A more coarse partition will remove the sufficiency (it will depend on \(\theta\))

**Definition (minimal sufficient statistics)** A sufficient statistics \(T(\mathbf{X})\) is called a minimal sufficient statistics if, for any other sufficient statistics \(T'(\mathbf{X})\), \(T(x)\) is a function of \(T'(x)\)

Note that the minimal sufficient statistics is also not unique, but its partition is unique

**Criterion (condition of minimality)** \(T(\mathbf{X})\) is a minimal sufficient statistics for \(\theta\) when the ratio \(f(x|\theta)/f(y|\theta)\) is constant as a function of \(\theta\) iff \(T(x)=T(y)\)

Poisson minimumal sufficient statistics

Recall that joint pdf of Poisson is

The ratio is

This ratio is independent of \(\theta\) iff \(\sum_i x_i = \sum_i y_i\), therefore the following \(T\) is a minimal sufficient statistic

#### 3.1.2. Complete Statistics

Ancillary statistics are complementary to sufficient statistics

**Definition (ancillary statistics)** A statistic \(S(X)\) whose distribution does not depend on the parameter \(\theta\) is called an *ancillary statistic*

Location family and scale family

let \(X_1,..., X_n\) be a sample from a location parameter family with cdf \(F(x-\theta)\), then

is an ancillary statistic

If it is from a scale parameter family with cdf \(F(x/\sigma)\), then any statistic whose function form depends on \(X_1 / X_n, ..., X_{n-1}/X_n\) is an ancillary statistics, for example

**Definition (complete statistic)** Let \(f(t|\theta)\) be a family of pdfs or pmfs for a statistic \(T(X)\), the family of probability distributions is called complete if

**Theorem (Basu)** If \(T(X)\) is a complete and minimal sufficient statistic, then \(T(X)\) is independent of every ancillary statistic.

#### 3.1.3. Sufficiency and Risk Reduction

Suppose we observe \(X_1, ..., X_n \sim p(X; \theta)\) and we would like to estimate \(\theta\) with \(\hat{\theta}(X_1, ..., X_n)\). In order to evaluate the estimator, we have the risk

The estimator that does not depend on sufficient statistics can be improved. Intuitively, this is to remove the variance within each sufficient partition by averaging them.

**Theorem (Rao-Blackwell)** le us define a new estimator using sufficient statistics

then it will always improve the risk

**Proof** By applying Jensen and the law of total expectation,

Bernoulli estimator

Let \(X_1, ..., X_n \sim Ber(\theta)\), we can improve the estimator \(\hat{\theta} = X_1\) using the sufficient statistics \(T = \sum_i X_i\), then

Both estimators are unbiased, but the new estimator reduces the variance from \(\theta(1-\theta)\) to \(\theta(1-\theta)/n\)

### 3.2. The Likelihood Principle

The likelihood principle says that the likelihood function summarizes all information about \(\theta\) from the sample. (You can only keep \(L(\theta|x)\) and discard \(x\) to save your memory)

**Principle (likelihood)** Suppose we have two experiments \(E_1 = (X_1, \theta, \{ f_1(x_1|\theta) \}), E_2= (X_2, \theta, \{ f_2(x_2|\theta) \})\) where \(\theta\) is same in both experiments. Suppose \(x^*_1, x^*_2\) are two sample points from \(E_1, E_2\) such that their likelihoods function are propostional

\(\(L(\theta | x^*_2) = CL(\theta | x^*_1)\)\) then

**Definition (likelihood function)** Let \(f(x|\theta)\) denoted the joint pdf or pmf of the sample \(X\), given that \(X=x\) is observed, the function of \(\theta\) defined as follows is called the likelihood function

The likelihood principles can be derived from the sufficiency principles and conditionality principles, the latter says that if one of two experiments is randomly chosen and done, the information about \(\theta\) depends only on the experiment performed.

relationship to sufficieny

Using the factorization theorem, we can see for any sufficient statistics, we can write

where \(h\) does not depend on \(h\) and can be ignored, therefore

Thus, once we have any sufficient statistics, we have everything we need to compute the likelihood function and we lost nothing

## 4. Measure Concentration

### 4.1. Inequalities and Tail Bound

The following inequalities are mostly about the bound of distribution tail.

**Inequality (Markov)** Let \(X\) be a nonnegative random variable (i.e. \(X \geq 0\)). Then

Chebyshev can be obtained from Markov inequality by taking the absolute value. (to maintain the nonnegativity)

It has a better bound \(O(\frac{1}{t^2})\) than Markov.

**Inequality (Chebyshev)** Let \(X\) be a random variable and \(EX = \mu, \mathrm{Var} X = \sigma^2\), then for any \(t > 0\)

Chernoff bound can also be obtained from Markov inequality by taking exp.

Chebyshev's inequality is conservative and it has an better exponential bound.

**Inequality (Chernoff)**

Suppose mgf exist in a neighbourhood of \(b\), then

Gaussian tail bound

By applying Chernoff method to Gaussian, we can obtain

By considering symmetry, we get

**Definition (sub-Gaussian random variable)** A sub-Gaussian random variable whose tails decay faster or equal than a Gaussian. Formally, a random variable \(X\) with mean \(\mu\) is sub-Gaussian if there exists a positive number \(\sigma\) such that

It is easy to see sub-Gaussian has the two sided exponential tail bound

iid sample mean of sub-Gaussian

Let \(X_1, ..., X_n\) be \(\sigma\)-sub Gaussian RV, then it can be shown

which says the sample mean is a \(\sigma/\sqrt{n}\) sub-Gaussian

general sample mean of sub-Gaussian

Let \(X_1, ..., X_n\) be \(\sigma_1, ..., \sigma_n\) sub-Gaussian, then the sample mean is \(\sigma\) sub-Gaussian with

It can be shown (using Radamacher random variable) that a \([a,b]\) bounded random variable is \(b-a\) sub-Gaussian. Then the sample mean is \((b-a)/\sqrt{n}\) sub-Gaussian, then we have the Hoeffding bound

**Inequality (Hoeffding)** Suppose a random variable is bound between \([a,b]\), then

**Inequality (HÃ¶lder)** Let \(X,Y\) be any two random variables and \(1/p + 1/q = 1\), then

**Inequality (Cauchy-Schwarz)** When \(p=q=2\), it is the Cauchy-Schwarz inequality

**Inequality (Minkowski)** Let \(X,Y\) be any two random variables and $ 1 \leq p \leq \infty$

**Inequality (Jensen)** For any random variable \(X\), if \(g(x)\) is a convex function, then

### 4.2. Empirical Process

First we consider the problem of estimating a CDF given a random sample, suppose \(X_1, ..., X_n \sim F_X\), the empirical CDF seems to be a natural estimator

We can see that the empirical cdf can converge to the true cdf pointwisely

**Lemma (consistency of empirical CDF)** For a fixed \(x\), it is easy to see

By Strong Law of Large Numbers, it converges to \(F_X(x)\) almost surely for every \(x\) pointwisely, therefore it is consistent

Note that the SLLN's condition here is \(E|\mathbf{1}(X_i \leq x)| < \infty\) which is obviously

A more stronger result is that this convergence happens uniformly

**Theorem (Glivenko-Cantelli)**

Note that this makes no assumption of the CDF.

Applications of Glivenko-Cantelli

we can construct estimators of various parameters of interests \(\theta(F)\) by replacing \(F\) with \(F_n\), these are called plug-in estimators. For example, median

To

If \(\theta\) is continuous wrt \(||\cdot||_\infty\), then \(\theta(\hat{F}) \to \theta(F)\) almost surely

Rewrite the previous conclusion in terms of the set, we have

where \(\mathcal{A} = \{ (-\infty, x] | x \in R \}\)

we are interested in collections of set \(\mathcal{A}\), for which we have the uniform convergence

That class satisfying this property is called Glivenko-Cantelli class, in general, very complex classes of functions or sets fail to be Glivenko-Cantelli class.

**Definition (Glivenko-Cantelli class)** \(F\) is a *Glivenko-Cantelli class* if

One example of GC class is the CDF

counterexample of GC class

Not every class is a GC class, consider the set \(\mathcal{A}\) which contains all subset of \([0,1]\) with finitely many elements. It is easy to see there exists a set \(A \in \mathcal{A}\) such that \(P_n(A) = 1\) and \(P(A) = 0\)

which fails to be a GC class. This roughly means it is too large

We actually have another stronger tail bound theorem than the Glivenko-Cantelli

**Theorem (Dvoretzky-Kiefer-Wolfowitz)** For any \(\epsilon > 0\) and all \(n\), we have

Notice this is similar to the conclusion of the Hoeffeding bound, which is related to the pointwise convergence

confidence interval using DKW inequality

We can construct a confidence set using DKW inequality, let \(\epsilon^2_n = \log(2/\alpha)/2n\), \(L(x) = \max(\hat{F}_n(x), 0), U(x) = \min(\hat{F}_n(x), 0)\), then for all \(x \in R\) and \(F\)

where \((L(X), U(X))\) is a nonparametric \(1-\alpha\) confidence band

## 5. Resampling

### 5.1. Bootstrap

Suppose given a sample \(X_1, ..., X_n \sim P\), we can define the empirical distribution

**Definition (bootstrap sample)** We can draw the bootstrap sample from the empirical distribution uniformly *with replacement*

Suppose we have an estimator \(\hat{\theta} = g(X_1, ..., X_n)\) and we want to estimate its variance \(Var(\hat{\theta})\), however, it is diffult as we do not know the \(P\), instead we use the bootstrap variance estimator

**Definition (bootstrap variance estimator)**

- draw sample \(X_1^*, ..., X_n^* \sim P_n\). Compute \(\hat{\theta}_n^* = g(X_1^*, ..., X_n^*)\)
- repeat the process to yield \(\hat{\theta}_{n, 1}^*, ..., \hat{\theta}_{n, B}^*\)
- compute

where \(\bar{\theta} = \sum_j \hat{\theta}^*_{n, j} / B\)

We want to justify the bootstrap: quantiles of the bootstrap distribution of our statistics is close to the quantile of actual distribution. suppose we define

and the true sampling distribution to be

the bootstrap works for example,

### 5.2. Jackknife

The Jackknife is a simple method for approximating the bias and variance of an estimator from each subsample of size \(n-1\) by omitting one observation

**Definition (Jackknife bias estimate)** The jackknife bias estimate of an estimator \(T_n = T(X_1, ..., X_n)\) is defined by

where \(\bar{T}_n = \sum_i T_{(-i)}/n\) and \(T_{(-i)}\) denote the statistic with the \(i\)-th observation removed.

Using this bias estimate, we can correct \(T_n\) with \(T_{jack}\) with

For many statistics with the bias of order \(O(1/n)\), this correction reduces the bias to \(O(1/n^2)\)

We define the pseudo-values as \(\tilde{T}_i = nT_n - (n-1)T_{(-i)}\), then

and we can estimate variance of \(T_n\) as well

**Definition (Jackknife variance estimate)** the variance estimate of \(Var(T_n)\) is

where \(\tilde{s}^2\) is the sample variance of the pseudo-values:

This is consistent with \(Var(T_n)\) under certain conditions on \(T\)