0x400 Foundation
- 1. Sampling Distribution
- 2. Sampling Statistics
- 3. Data Reduction Principles
- 4. Measure Concentration
- 5. Resampling
1. Sampling Distribution
1.1. Chi-Square
Distribution (\(\chi^2\) distribution) \(\chi^2\) distribution is a special case of gamma distribution. The chi square distribution with p degrees of freedom has pdf of Gamma\((p/2, 2)\)
Note that when \(p=1\), the distribution can be computed by applying \(Y=X^2\) to normal distribution. (Cassela Example 2.1.9)
properties (\(\chi^2\) distribution) By applying Gamma distribution, we obtain following results
Lemma If \(Z\) is a n(0,1) random variable \(Z^2 \sim \chi_1^2\)
Lemma If \(X_1, ..., X_n\) are independent and \(X_i \sim \chi_{p_i}^2\), then \(X_1 + ... +X_n \sim \chi^2_{p1+...+pn}\)
Distribution (noncentral \(\chi^2\) distribution) Let \(X_1, X_2, ..., X_n\) be an independent random variables where \(X_i \sim N(\mu_i, \sigma^2)\), then the distribution \(Y\)
is called to have a noncentral \(\chi^2\) distribution
where the first parameter is again the degree of freedom, the second parameter is called the noncentrality parameter
1.2. T distribution
In most practical cases, the variance \(\sigma^2\) is unknown. thus to get any idea of the variability of \(\hat{X}\), we need to estimate this variance first.
Distribution (Student's t, Gosset) Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) distribution. The quantity \((\bar{X}-\mu)/(S/\sqrt{n})\) has Student's t distribution with \(n-1\) degrees of freedom. It has pdf
The distribution is a distribution of \(U/\sqrt(V/p)\) where \(U \sim n(0,1), V \sim \chi^2_p\)
Student's t has no mgf because it does not have moments of all orders. If there are \(p\) degrees of freedom, then there are only \(p-1\) moments.
t distribution has an important property robustness, which means it is much less sensitive to the outlines than Gaussian distributions
Distribution (non-central t distribution) Let \(X \sim N(\mu, 1)\) and \(Y \sim \chi^2(n)\), then
has a noncentral t distribution with \(\mu\) noncentrality parameter
1.3. F distribution
Distribution (Snedecor's F, Fisher) Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu_X, \sigma^2_X)\) population and \(Y_1, ..., Y_m\) from \(n(\mu_Y, \sigma^2_Y)\) population. The random variable
has a Snedecor's F distribution with \(n-1, m-1\) degrees of freedom.
Lemma
If \(X \sim F_{(p,q)}\), then \(1/X \sim F_{(q,p)}\)
If \(X \sim t_q\), then \(X^2 \sim F_{(1,q)}\)
2. Sampling Statistics
Sample is a sequence of iid random variable, statistic is its function.
Definition (random sample) The collection of random variables \(X_1, X_2, ... X_n\) is said to be a random sample of size \(n\) if they are independent and identically distributed (i.i.d), i.e
- \(X_1, ..., X_n\) are independent random variables
- They have same distribution
\(x_1, x_2, ..., x_n\) is said to their realizations. The joint pdf or pmf or \(X_1, ..., X_n\) is given by
This random sampling model is called sampling from an infinite population. Its assumption is that, after observing \(X_1 = x_1\), the probability distribution \(X_2\) is unaffected.
In the case of sampling from a finite population, there are two cases
- sampling with replacement: this meets the previous definition
- sampling without replacement: the random variables are no longer independent (but still identically distributed). When population is large enough, it can still be approximated with the previous condition.
Definition (statistic, sampling distribution) The random variable \(Y\) defined over a random sample \(X_1, ..., X_n\) and a real-valued function \(T\) is called a statistic,
The statistics, however, cannot depend on any parameter \(\theta\)!!
The distribution of a statistic \(Y\) is called the sampling distribution of \(Y\). Once the sample is drawn, the \(t\) is called the realization of \(T\), where
statistics is random variable
I always get confused, but be careful that statistic is a random variable, not a scalar (because it is a transformations (function) of random variables)
- variance is a scalar
- sample variance is a random variable
Lemma (expectation of random sample) Let \(X_1, ..., X_n\) be a random sample and \(g(x)\) be a function such that \(Eg(X)\) and \(Var(g(X))\) exists then
Note: the second one is proved using zero covariance between iid variables
Definition (statistical model) A statistical model is a set of distributions.
Definition (parametric, non parametric) A parametric model is a set of statistical models that can be parameterized by a finite number of parameters (e.g: normal distribution). A nonparametric model is a set that cannot be.
2.1. Sample Mean
Definition (sample mean) The sample mean is the statistic defined by
Lemma (mean, variance of sample mean) Let \(X_1, ..., X_n\) be a random sample with mean \(\mu\) and variance $\sigma^2 < \infty $
Therefore, \(\bar{X}\) is an unbiased estimator of \(\mu\).
The multivariate version: Let \(\mathbf{X}_1, ..., \mathbf{X}_n\) be a random sample from a joint distribution with mean vector \(\mathbf{\mu}\), and covariance matrix \(\Sigma\)
anti-pattern in Cauchy distribution
Unlike the variance of sample mean is reduced by the sample size. It is important to note that the dispersion of the sample mean \(\bar{X}\) of Cauchy distribution \(Cauchy(\mu, \sigma)\), which is measured by \(\sigma\) is invariant of the sample size. Cauchy distribution does not have mean or variance.
variance of sample mean in the case of sampling without replacement
As mentioned previously, a sample in the case of sampling without replacement is no longer independent. The variance of sample mean is
Notice it is smaller than the previous infinite case.
Lemma (pdf of sample mean) By applying transformation, we can obtain
Lemma (mgf of sample mean) Let \(X_1, ..., X_n\) be a random sample from a population with mgf \(M_X(t)\). Then the mgf of the sample mean is
sample mean of Gamma distribution
Suppose \(X_i \sim gamma(\alpha, \beta)\), then the mgf of the sample mean is
which shows that \(\bar{X} \sim gamma(n\alpha, \beta/n)\)
Weak Law of Large Number
Let \(X_1, ..., X_n\) be iid random variables with mean and variance, then for every \(\epsilon\)
central limit theorem
Let \(X_1, ..., X_n\) be iid random variables with mean and variance, let \(G_n(x)\) denote the cdf of \(\sqrt{n}(\bar{X}_n - \mu)/\sigma\), then
that is, \(\sqrt{n}(\bar{X}_n - \mu)/\sigma\) has a limiting standard normal distribution.
2.2. Sample Variance
Definition (sample variance) The sample variance is the statistic defined by
By the way, this can be computed with statistics.variance in python.
A useful transformation to compute \((n-1)S^2\) is
Notice the analogy between this and variance computing.
Lemma (mean, variance of sample variance)
where \(\theta_i\) is the i-th central moment. The first part is easy, the second part is crazy, see solution of Casella Berger exercise 5.8.
Therefore, \(S^2\) is an unbiased estimator of \(\sigma^2\)
The multivariate version
sample variance of normal distribution
In the case of normal distribution, the sample mean and variance have several good properties as follows:
Let \(X_1, ..., X_n\) be a random sample from a \(n(\mu, \sigma^2)\) distribution. Then
Taking variance on both side, we obtain
Note this can also be obtained by computing \(\theta_4\) from the general variance formula.
independence of \(\bar{X}, S^2\) in normal distribution
\(\bar{X}, S^2\) are independent in normal distribution
Note that in general, \(\bar{X}, S^2\) are not always independent, it requires the 3rd central moment to be 0 to be independent.
2.3. Sample Order
Definition (order statistics) The order statistics of a random sample \(X_1, ..., X_n\) are sample values placed in ascending order. They are denoted by \(X_{(1)}, ..., X_{(n)}\).
Definition (sample range) the sample range is the distance between the smallest and largest observtions.
Definition (sample median) the sample median \(M\) is defined as
One advantage of sample median over the sample mean is that it is less affected by extreme observations.
Other related statistics are lower quartile and upper quartile.
Proposition (discrete order statistics) Let \(X_1, ..., X_n\) be a random sample from a discrete distribution with pmf \(f_X(x_i) = p_i\) where \(x_1 < x_2 < ...\) are possible values of \(X\) in ascending order. Let \(P_i = \sum_{k=1}^i p_k\), then cdf of order statistics is
and the pmf is
Proposition (continuous order statistics) Consider the continuous population with cdf \(F_X(x)\) and pdf \(f_X(x)\), then the cdf is
By differentiating this one, we can get
Proposition (joint pdf) The joint pdf of all the order statistics is given by
if \(x_1 < ... < x_n\), otherwise 0.
3. Data Reduction Principles
Data reduction in terms of a particular statistic \(T\) can be thought of as a partition of the sample space into \(A_t = \{ x | T(x)=t \}\). The partition should not discard important information about the unknown parameter \(\theta\). Instead of reporting \(x\), we only report its partition \(T(x)\) (whose size is much smaller than \(x\))
Definition (experiment) Formally an experiment \(E\) is defined to be
Definition (evidence) An experimenter, knowing what experiment \(E\) is performed and having observed a sample \(x\), can make some inference or draw some conclusion about \(\theta\), which denoted by
which stands for the evidence about \(\theta\) arising from \(E, x\)
3.1. The sufficiency principle
The sufficiency principle says the sufficient statistics summarizes all the information about \(\theta\) in the sample. (You can only keep \(T(x)\) and discard \(x\) to save your memory)
Principle (sufficiency) Consider an experiment \(E=(X, \theta, f(x|\theta))\) and suppose \(T(X)\) is a sufficient statistics for \(\theta\).
Intuitively, this means: If \(\mathbf{x}, \mathbf{y}\) are two sample points such that \(T(\mathbf{x}) = T(\mathbf{y})\), then the inference about \(\theta\) should be the same whether \(\mathbf{X}=\mathbf{x}\) or \(\mathbf{X}=\mathbf{y}\) is observed
The sufficient statistic is defined as follows
Definition (sufficient statistic) A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) if the conditional distribution of the sample \(\mathbf{X}\) given the value of \(T(\mathbf{X})\) does not depend on \(\theta\)
Criterion (condition of sufficient statistics) If the following ratio is constant as a function of \(\theta\), then \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\)
Bernoulli sufficient statistic
Let \(X_1, ..., X_n\) be Bernoulli random variables with parameter \(\theta\). Then one of the sufficient statistic for \(\theta\) is
This can be proved by consider \(p(x|\theta)\) and \(q(T(x)\theta)\)
Their ratio is constant with respect to \(\theta\)
Criterion (Fisher-Neyman factorization) Let \(f(\mathbf{x}|\theta)\) denote the joint pdf of a sample \(\mathbf{X}\). A statistic \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) iff
Note that \(g(T(x)|\theta)\) is not necessarily a distribution.
sufficient statistics of exponential distribution
Let \(X_1,..., X_n\) from a exponential family
then the following is a sufficient statistic for \(\theta\)
It turns out that outside of the exponential family of distributions, it is rare to have a sufficient statistic of smaller dimension than the size of the sample, so in many cases the order statistics are the best that we can do.
Sufficient statistic can be interpreted with the concept of sufficient partition
Definition (sufficient partition) A partition \(B_1, ..., B_k\) is called sufficient partition if \(f(x|X \in B)\) does not depend on \(\theta\).
A statistic \(T\) induces a partition. \(T\) is sufficient iff its partition is sufficient. If we get finer partition from one sufficient partition, it is also sufficient partition (statistics).
3.1.1. Minimal Sufficient statistics
If there are more than one sufficient statistics, the sufficient statistics which achieve the most data reduction is the minimal sufficient statistic.
Their partition is corresponding to the most coarse sufficient partition. A more coarse partition will remove the sufficiency (it will depend on \(\theta\))
Definition (minimal sufficient statistics) A sufficient statistics \(T(\mathbf{X})\) is called a minimal sufficient statistics if, for any other sufficient statistics \(T'(\mathbf{X})\), \(T(x)\) is a function of \(T'(x)\)
Note that the minimal sufficient statistics is also not unique, but its partition is unique
Criterion (condition of minimality) \(T(\mathbf{X})\) is a minimal sufficient statistics for \(\theta\) when the ratio \(f(x|\theta)/f(y|\theta)\) is constant as a function of \(\theta\) iff \(T(x)=T(y)\)
Poisson minimumal sufficient statistics
Recall that joint pdf of Poisson is
The ratio is
This ratio is independent of \(\theta\) iff \(\sum_i x_i = \sum_i y_i\), therefore the following \(T\) is a minimal sufficient statistic
3.1.2. Complete Statistics
Ancillary statistics are complementary to sufficient statistics
Definition (ancillary statistics) A statistic \(S(X)\) whose distribution does not depend on the parameter \(\theta\) is called an ancillary statistic
Location family and scale family
let \(X_1,..., X_n\) be a sample from a location parameter family with cdf \(F(x-\theta)\), then
is an ancillary statistic
If it is from a scale parameter family with cdf \(F(x/\sigma)\), then any statistic whose function form depends on \(X_1 / X_n, ..., X_{n-1}/X_n\) is an ancillary statistics, for example
Definition (complete statistic) Let \(f(t|\theta)\) be a family of pdfs or pmfs for a statistic \(T(X)\), the family of probability distributions is called complete if
Theorem (Basu) If \(T(X)\) is a complete and minimal sufficient statistic, then \(T(X)\) is independent of every ancillary statistic.
3.1.3. Sufficiency and Risk Reduction
Suppose we observe \(X_1, ..., X_n \sim p(X; \theta)\) and we would like to estimate \(\theta\) with \(\hat{\theta}(X_1, ..., X_n)\). In order to evaluate the estimator, we have the risk
The estimator that does not depend on sufficient statistics can be improved. Intuitively, this is to remove the variance within each sufficient partition by averaging them.
Theorem (Rao-Blackwell) le us define a new estimator using sufficient statistics
then it will always improve the risk
Proof By applying Jensen and the law of total expectation,
Bernoulli estimator
Let \(X_1, ..., X_n \sim Ber(\theta)\), we can improve the estimator \(\hat{\theta} = X_1\) using the sufficient statistics \(T = \sum_i X_i\), then
Both estimators are unbiased, but the new estimator reduces the variance from \(\theta(1-\theta)\) to \(\theta(1-\theta)/n\)
3.2. The Likelihood Principle
The likelihood principle says that the likelihood function summarizes all information about \(\theta\) from the sample. (You can only keep \(L(\theta|x)\) and discard \(x\) to save your memory)
Principle (likelihood) Suppose we have two experiments \(E_1 = (X_1, \theta, \{ f_1(x_1|\theta) \}), E_2= (X_2, \theta, \{ f_2(x_2|\theta) \})\) where \(\theta\) is same in both experiments. Suppose \(x^*_1, x^*_2\) are two sample points from \(E_1, E_2\) such that their likelihoods function are propostional
\(\(L(\theta | x^*_2) = CL(\theta | x^*_1)\)\) then
Definition (likelihood function) Let \(f(x|\theta)\) denoted the joint pdf or pmf of the sample \(X\), given that \(X=x\) is observed, the function of \(\theta\) defined as follows is called the likelihood function
The likelihood principles can be derived from the sufficiency principles and conditionality principles, the latter says that if one of two experiments is randomly chosen and done, the information about \(\theta\) depends only on the experiment performed.
relationship to sufficieny
Using the factorization theorem, we can see for any sufficient statistics, we can write
where \(h\) does not depend on \(h\) and can be ignored, therefore
Thus, once we have any sufficient statistics, we have everything we need to compute the likelihood function and we lost nothing
4. Measure Concentration
4.1. Inequalities and Tail Bound
The following inequalities are mostly about the bound of distribution tail.
Inequality (Markov) Let \(X\) be a nonnegative random variable (i.e. \(X \geq 0\)). Then
Chebyshev can be obtained from Markov inequality by taking the absolute value. (to maintain the nonnegativity)
It has a better bound \(O(\frac{1}{t^2})\) than Markov.
Inequality (Chebyshev) Let \(X\) be a random variable and \(EX = \mu, \mathrm{Var} X = \sigma^2\), then for any \(t > 0\)
Chernoff bound can also be obtained from Markov inequality by taking exp.
Chebyshev's inequality is conservative and it has an better exponential bound.
Inequality (Chernoff)
Suppose mgf exist in a neighbourhood of \(b\), then
Gaussian tail bound
By applying Chernoff method to Gaussian, we can obtain
By considering symmetry, we get
Definition (sub-Gaussian random variable) A sub-Gaussian random variable whose tails decay faster or equal than a Gaussian. Formally, a random variable \(X\) with mean \(\mu\) is sub-Gaussian if there exists a positive number \(\sigma\) such that
It is easy to see sub-Gaussian has the two sided exponential tail bound
iid sample mean of sub-Gaussian
Let \(X_1, ..., X_n\) be \(\sigma\)-sub Gaussian RV, then it can be shown
which says the sample mean is a \(\sigma/\sqrt{n}\) sub-Gaussian
general sample mean of sub-Gaussian
Let \(X_1, ..., X_n\) be \(\sigma_1, ..., \sigma_n\) sub-Gaussian, then the sample mean is \(\sigma\) sub-Gaussian with
It can be shown (using Radamacher random variable) that a \([a,b]\) bounded random variable is \(b-a\) sub-Gaussian. Then the sample mean is \((b-a)/\sqrt{n}\) sub-Gaussian, then we have the Hoeffding bound
Inequality (Hoeffding) Suppose a random variable is bound between \([a,b]\), then
Inequality (Hölder) Let \(X,Y\) be any two random variables and \(1/p + 1/q = 1\), then
Inequality (Cauchy-Schwarz) When \(p=q=2\), it is the Cauchy-Schwarz inequality
Inequality (Minkowski) Let \(X,Y\) be any two random variables and $ 1 \leq p \leq \infty$
Inequality (Jensen) For any random variable \(X\), if \(g(x)\) is a convex function, then
4.2. Empirical Process
First we consider the problem of estimating a CDF given a random sample, suppose \(X_1, ..., X_n \sim F_X\), the empirical CDF seems to be a natural estimator
We can see that the empirical cdf can converge to the true cdf pointwisely
Lemma (consistency of empirical CDF) For a fixed \(x\), it is easy to see
By Strong Law of Large Numbers, it converges to \(F_X(x)\) almost surely for every \(x\) pointwisely, therefore it is consistent
Note that the SLLN's condition here is \(E|\mathbf{1}(X_i \leq x)| < \infty\) which is obviously
A more stronger result is that this convergence happens uniformly
Theorem (Glivenko-Cantelli)
Note that this makes no assumption of the CDF.
Applications of Glivenko-Cantelli
we can construct estimators of various parameters of interests \(\theta(F)\) by replacing \(F\) with \(F_n\), these are called plug-in estimators. For example, median
To
If \(\theta\) is continuous wrt \(||\cdot||_\infty\), then \(\theta(\hat{F}) \to \theta(F)\) almost surely
Rewrite the previous conclusion in terms of the set, we have
where \(\mathcal{A} = \{ (-\infty, x] | x \in R \}\)
we are interested in collections of set \(\mathcal{A}\), for which we have the uniform convergence
That class satisfying this property is called Glivenko-Cantelli class, in general, very complex classes of functions or sets fail to be Glivenko-Cantelli class.
Definition (Glivenko-Cantelli class) \(F\) is a Glivenko-Cantelli class if
One example of GC class is the CDF
counterexample of GC class
Not every class is a GC class, consider the set \(\mathcal{A}\) which contains all subset of \([0,1]\) with finitely many elements. It is easy to see there exists a set \(A \in \mathcal{A}\) such that \(P_n(A) = 1\) and \(P(A) = 0\)
which fails to be a GC class. This roughly means it is too large
We actually have another stronger tail bound theorem than the Glivenko-Cantelli
Theorem (Dvoretzky-Kiefer-Wolfowitz) For any \(\epsilon > 0\) and all \(n\), we have
Notice this is similar to the conclusion of the Hoeffeding bound, which is related to the pointwise convergence
confidence interval using DKW inequality
We can construct a confidence set using DKW inequality, let \(\epsilon^2_n = \log(2/\alpha)/2n\), \(L(x) = \max(\hat{F}_n(x), 0), U(x) = \min(\hat{F}_n(x), 0)\), then for all \(x \in R\) and \(F\)
where \((L(X), U(X))\) is a nonparametric \(1-\alpha\) confidence band
5. Resampling
5.1. Bootstrap
Suppose given a sample \(X_1, ..., X_n \sim P\), we can define the empirical distribution
Definition (bootstrap sample) We can draw the bootstrap sample from the empirical distribution uniformly with replacement
Suppose we have an estimator \(\hat{\theta} = g(X_1, ..., X_n)\) and we want to estimate its variance \(Var(\hat{\theta})\), however, it is diffult as we do not know the \(P\), instead we use the bootstrap variance estimator
Definition (bootstrap variance estimator)
- draw sample \(X_1^*, ..., X_n^* \sim P_n\). Compute \(\hat{\theta}_n^* = g(X_1^*, ..., X_n^*)\)
- repeat the process to yield \(\hat{\theta}_{n, 1}^*, ..., \hat{\theta}_{n, B}^*\)
- compute
where \(\bar{\theta} = \sum_j \hat{\theta}^*_{n, j} / B\)
We want to justify the bootstrap: quantiles of the bootstrap distribution of our statistics is close to the quantile of actual distribution. suppose we define
and the true sampling distribution to be
the bootstrap works for example,
5.2. Jackknife
The Jackknife is a simple method for approximating the bias and variance of an estimator from each subsample of size \(n-1\) by omitting one observation
Definition (Jackknife bias estimate) The jackknife bias estimate of an estimator \(T_n = T(X_1, ..., X_n)\) is defined by
where \(\bar{T}_n = \sum_i T_{(-i)}/n\) and \(T_{(-i)}\) denote the statistic with the \(i\)-th observation removed.
Using this bias estimate, we can correct \(T_n\) with \(T_{jack}\) with
For many statistics with the bias of order \(O(1/n)\), this correction reduces the bias to \(O(1/n^2)\)
We define the pseudo-values as \(\tilde{T}_i = nT_n - (n-1)T_{(-i)}\), then
and we can estimate variance of \(T_n\) as well
Definition (Jackknife variance estimate) the variance estimate of \(Var(T_n)\) is
where \(\tilde{s}^2\) is the sample variance of the pseudo-values:
This is consistent with \(Var(T_n)\) under certain conditions on \(T\)