0x401 Classical Inference
- 1. Point Estimation
- 2. Hypothesis Testing
- 3. Interval Estimation
- 4. Reference
The basic problem of statistical inference is the inverse of probability: Given the outcomes, what can we say about the process that generated the data?
More formally, Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.
-
Frequentist Inference: the unknown quantity is assumed to be a fixed quantity
-
Bayesian Inference: the unknown quantity is assumed to be a random variable, we have some initial guess about the distribution, and update the distribution after observing the data
1. Point Estimation
Point estimation consists of two part: how to find a point estimator, how to evaluate them
Definition (point estimator) A point estimator is any function \(W(X_1, X_2, ..., X_n)\) of a sample; that is, any statistic is a point estimator.
An estimator is a function of the sample, an estimate is the realized value of estimator. There might be natural candidates for point estimators, but not always follow our intuitions
1.1. Methods of Finding Estimators
1.1.1. Method of Moments
This method is perhaps the oldest method of finding point estimators, it is a good start place when other methods prove intractable. Also it can be applied to obtain approximations to the distributions of statistics. (Satterthwaite approximation)
Definition (method of moments estimator) Equating the sample moments to the corresponding population moments and solve the resulting system.
The left side is a function of \(\theta_1, ..., \theta_k\) and the right side and moments statistics. Solving this systems we obtain
which is the estimator for \(\theta_i\)
MoM of Gaussian distribution
Consider \(X_1, ..., X_n \sim N(\mu, \sigma^2)\), then by applying MoM
By solving this, we get \(\hat{\mu}, \hat{\sigma}^2\)
MoM of GMM
Sometimes MoM is easier to solve than the MLE, for example if a GMM has 2 gaussian and 5 parameters, we can compute up to 5 or 6 moments (using MGF) and solve with MoM. It is sometimes stable than the MLE
1.1.2. Maximum Likelihood Estimators
Definition (maximum likelihood estimator) Maximum likelihood estimator \(\hat{\theta}(\mathbf{x})\) is the parameter \(\theta\) to maximize \(L(\theta | \mathbf{x})\)
Theorem (invariance property of MLE) If \(\hat{\theta}\) is the MLE of \(\theta\), then for any function \(\tau(\theta)\), the MLE of \(\tau(\theta)\) is \(\tau(\hat\theta)\)
Proposition (properties of MLE) - \(\hat{\theta}_{MLE}\) is asymptotically consistent - \(\hat{\theta}_{MLE}\) is asymptotically unbiased - \(\hat{\theta}_{MLE}\) is approximately a normal random variable
MLE and KL divergence
One way to interpret MLE is to view it as minimizing the dissimilarity between the empirical distribution \(P(x | \theta)\) and true distribution \(P(x | \theta^*)\). The dissimilarity is the KL divergence
where the first term is constant and maximizing the second term is asymptotically equivalent to the minimization of MLE by the law of large number
Properties (Drawbacks of MLE) The drawbacks of MLE are - finding and verifying the global maximum is difficult - numerical sensitivity (easy to overfitting, high variance).
How to solve these two problems?
-
To solve the first one, we can apply the numerical approach. MLE might be maximized numerically if likelihood can be written down.
-
To solve the 2nd, The Bayesian approach might be helpful to solve this issue or try to scale up the dataset...
For example, in the Markov language model, MLE language model have the zero count issue, therefore smoothing is necessary. This is a Bayesian approach: Laplace smoothing can be interpreted as a MAP with a uniform prior.
1.1.3. Bayes Estimators
In the classical approach, the parameter \(\theta\) is thought to be an unknown, but fixed quantity. In the Bayesian approach, \(\theta\) is considered to be a random variable from a probability distribution \(\pi(\theta)\)(prior distribution), which is subjective distribution by the experimenter's belief.
The distribution is updated into posterior distribution \(\pi(\theta|x)\) based on sample observed
Note that there are lots of ways to get Bayes estimators. For example,
One way to compute Bayes point estimator is to take its mean
Definition (MAP) Another is to get the mode, which is the maximum a posteriori probability (MAP) estimate \(\(\hat{\theta} = \mathrm{argmax}_\theta \pi(\theta|x)\)\)
Theorem (Bernstein-von Mises) Under some regularity condition, the posterior is close to a Guassian distribution
where \(\hat{\theta}_n\) is MLE and the distance measure is the total variation.
Example
Suppose we want to model a binary random variable \(X \in \{ 0, 1 \}\) with Bernoulli distribution
Given the sample \(\mathcal{D}=\{X_1, ..., X_n \}\), we know the likelihood can be written as
The ML estimator is
where \(m = \sum_n X_i\) is the sufficient statistics.
The conjugate of Bernoulli/Binomial distribution is the Beta distribution, recall its pdf form is
It has the same shape of \(\mu^X (1-\mu)^{1-X}\), therefore can serve as the prior and posterior.
The posterior has the form
Using the properties of beta distributions, we know the MAP estimator is
however, be careful that posterior mean is different
If our goal is to predict the outcome of the next trial, we can
1.2. Methods of Evaluating Estimators
The general topic of evaluating statistical procedures is the branch known as decision theory
1.2.1. Mean Squared Error
Definition (mean squared error) The mean squared error (MSE) of an estimator \(W\) of a parameter \(\theta\) is the function of \(\theta\) defined by \(E_{\theta}(W-\theta)^2\)
Lemma (bias-variance decomposition) The mean square error have a bias-variance decomposition as follows:
In ML terms (Andrew Ng' lecture)
- Bias is an error from the algorithm/estimator itself, it is corresponding to underfitting (high error in training set)
- Variance is an error from sensitivity to small fluctuations in the training set, which causes the bad performance in dev set. It is corresponding to overfitting (high error in test set).
Definition (bias) The bias of a point estimator \(W\) of a parameter \(\theta\) is the difference between the expected value of \(W\) and \(\theta\); that is
Definition (unbiasedness) An estimator whose bias is identically equal to 0 is called unbiased when for all \(\theta\)
MSE for normal distribution
Let \(X_1, ..., X_n\) be \(n(\mu, \sigma^2)\) distribution, the statistics \(\bar{X}, S^2\) are both unbiased estimator, therefore their MSE are
small increase in bias might decrase a large variance
Let \(X_1, ..., X_n\) be \(n(\mu, \sigma^2)\) distribtuion, compare the unbiased estimator \(S^2\) and the MLE (biased)
We can evaluate its MSE as
This MSE is smaller than the MSE for the unbiased estimator (because variance is smaller)
1.2.2. Best Unbiased Estimator
There is no one "best MSE" estimator as the class of all estimators is too large (e.g: the estimator \(\hat{\theta}=17\) cannot be beaten in MSE at \(\theta=17\), but is terrible otherwise).
One way to make it trackable is to limit the class of estimators to consider the unbiased estimators, under this assumption, we can only compare the variance to minimize MSE
Definition (unbiased best estimator) An esitmator \(W^*\) is a best unbiased estimator of \(\tau(\theta)\) if it satisfies \(E_\theta W^* = \tau(\theta)\) for all \(\theta\). For any other estimator \(W\) with \(E_\theta(W) = \tau(\theta)\) we have for all \(\theta\)
\(W^*\) is also called a uniform minimum variance unbiased estimator.
Even within the class of unbiased estimators, candidates of best estimators can be infinitely many, so it might be hard to verify that an estimator is the best one. To find the lower bound, we need to define some related concepts.
Definition (score function) The score function is defined to be
Score function is a random variable over parameter \(\theta\). It also can be defined using a sample \(X = (X_1, ..., X_n)\)
Lemma (expectation of score function) One property of score function is its expectation is 0
Proof: by Dominated Convergence Theorem, we can derive this as follows:
Lemma (variance of score function) variance of the score function is called the Fisher information
In essense, the Fisher information is measuring the expected curvature of the log-likelihood function around the point \(\theta\). If the log-likelihood is more curved (larger \(I(\theta)\)), then \(\theta\) is easier to estimate.
Definition (Fisher information) As discussed, the Fisher information is defined as
When there are \(d\) parameters, \(\theta = (\theta_1, ..., \theta_d)\), then the Fisher information is a matrix, called Fisher Information Matrix (FIM) with shape \(d \times d\). The matrix is positive semidefinte matrix. If it is positive definite matrix, it defines a Riemannian metric on the \(d\) dimensional parameter space, which is used in Information Geometry.
It is easy to see
Under some regularies, this can be simplified into
The bigger Fisher information indicates more information about \(\theta\), therefore small variance of the best unbiased estimator
Bernoulli distribution
Let \(X \sim Ber(p)\) , then the loglikelihood is
Take the derivative gives the score function
The Fisher information is
Normal distribution
Let \(X \sim N(\mu, \sigma^2)\), then the score for \(\mu\) is
The Fisher information is
exponential family
The loglikelihood of exponential family is as follows (by ignoring constants)
The fisher information is simply the Hessian of log-partition function
Cramer-Rao gives the lower bound on the variance of an unbiased estimator
Theorem (unbiased scalar Cramer-Rao Lower bound)
Theorem (scalar Cramer-Rao Lower bound) Let \(X_1, ..., X_n\) be a sample of pdf \(f(x|\theta)\), and let \(W(X)=W(X_1, ..., X_n)\) be any estimator satisfying
and
then
Note the formula might be cleaned a bit more, if \(W(X)\) is the unbiased estimator, then obviously the nominator is
Under the iid assumption of sample, the denominator can be
A shortcoming of Cramer-Rao is its bound may be strictly smaller than the variance of any unbiased estimator.
normal variance bound cannot be attained
Let \(X_1, ..., X_n\) be \(n(\mu, \sigma^2)\) and consider estimation of \(\sigma^2\) where \(\mu\) is unknown.
The CR lower bound of any unbliased estimator \(W\) of \(\sigma^2\) is
However, the sample variance has
When \(\mu\) is unknown, the bound cannot be attained
1.2.3. Sufficiency and Unbiasedness
Theorem (Rao-Blackwell) Let \(W\) be an unbiased estimator for \(\tau(\theta)\), and \(T\) be a sufficient statistic for \(\theta\). Then the following \(\phi(T)\) is a uniformly better unbiased estimator of \(\tau(\theta)\)
1.3. Decision Theory
Mean square error loss is a special case of loss function, the study of the performance of estimators using loss function is a branch of decision theory. A decision theoretic analysis would judge how well an estimator succeeded in simultaneously minimizing both bias and variance.
Definition (risk function) The quality of an estimator is quantified in its risk function wrt estimator \(\delta(X)\)
MSE is the risk with square error loss
For square error loss,
the risk function is mean square error (MSE) where
For a fixed \(\theta\), the risk function is the average loss that will be incurred if the estimator \(\delta(x)\) is used. However, since the true \(\theta\) is unknown, we would like to use an estimator that has a samll value \(R(\theta, \delta)\) for all values of \(\theta\)
As the optimality will change based on different \(\theta\), to compare two estimators, we compare their risk functions. However there is no clear answer which one is better
there is no uniformly better estimator
Let \(X \sim N(\theta, 1)\) and we consider two estimators \(\hat{\theta}_1 = X\) and \(\hat{\theta}_2 = 3\). The risk function of the first one is \(R(\theta, \hat{\theta}_1) = Var(\theta) = 1\), the other is \(R(\theta, \hat{\theta}_2) = (\theta-3)^2\). Neither estimator donimate the other uniformly.
Therefore, we need a one number summary of risk function. Two of such summary are Bayes risk and maximum risk, each of them suggests methods of devising estimator
- Bayes risk leads to Bayes estimator
- Maximum risk leads to Minimax estimator
1.3.1. Bayes Risk
Definition (Bayes risk) In Bayesian analysis we would use the prior distribution \(\pi(\theta)\) to compute an average risk, known as the Bayes risk
The Bayes estimator or Bayes decision rule is one which minimizes the expected risk:
An alternative approach is to use minimax risk
The Bayes risk can be rewritten using posterior risk \(r(\delta | x^n)\) where
Theorem (Bayes estimator) The Bayes risk satisfies
where the Bayes estimator is to minimize posterior risk for each \(x\)
Bayes estimator for mean square error
For squared error loss, the posterioir risk is
The minimizer is the Bayes estimator, it is known as the minimum mean square error (MMSE) estimator
Bayes estimator for other losses
If \(L(\theta, \delta) = | \theta - \delta|\), then the Bayes estimator is the median of posterior, if loss is 0-1 loss, then the Bayes estimator is the mode of posterior
1.3.2. Minimax Risk
Definition (maximum risk) The maximum risk of an estimator is
A minimax rule is one which minimizes the maximum risk
We can give both upper and lower bound of minimax risk:
Lemma (upper bound of minimax risk) Given an estimator \(\hat{\theta}_{up}\), we can use it to upper bound minimax risk
Lemma (lower bound of minimax risk) The Bayes risk of the Bayes estimator for any prior \(\pi\) lower bound the minimax risk
Theorem (least favorable prior) Let \(\hat{\theta}\) be the Bayes estimator for some prior \(\pi\), if
then \(\hat{\theta}\) is minimax and \(\pi\) is called a least favorable prior.
Theorem (constant risk) Suppose \(\hat{\theta}\) is the Bayes estimator wrt some prior \(\pi\). If the risk is constant, then \(\hat{\theta}\) is minimax.
1.4. Asymptotic Evaluation
1.4.1. Consistency
The consistency property requires the estimator converges to the correct value as the sample size becomes infinite
Definition (consistency) A sequence of estimators \(W_n = W_n(X_1, ..., X_n)\) is a consistent sequence of estimators of the parameter \(\theta\) if for every \(\epsilon > 0\) and every \(\theta \in \Theta\)
Theorem (consistency of MLE) Let \(X_1, ..., X_n\) be iid \(f(x|\theta)\) and \(L(\theta |x)\) be the likelihood function, let \(\hat{\theta}\) denote the MLE of \(\theta\) and \(\tau{\theta}\) be a continuous function of \(\theta\), under some regularity conditions, for every \(\epsilon > 0\), \(\theta \in \Theta\)
that is \(\tau(\hat{\theta})\) is a consistent estimator of \(\tau(\theta)\)
1.4.2. Efficiency
Efficiency is concerned with the asymptotic variance of an estimator
Definition (limiting variance) For an estimator \(T_n\), if \(\lim_{n \to \infty} k_n Var(T_n) = \tau^2 < \infty\), where \(k_n\) is a sequence of constants, then \(\tau^2\) is called the limiting variance or limit of the variances.
Definition (asymptotic variance) For an estimator \(T_n\) with the following condition, the parameter \(\sigma^2\) is called the asymptotic variance or variance of the limit distribution of \(T_n\)
Definition (asymptotically efficient) A sequence of estimators \(W_n\) is asymptotically efficient for a parameter \(\tau(\theta)\) if \(\sqrt{n}[W_n - \tau(\theta)] \to n(0, \nu(\theta))\) in distribution and
that is, the asymptotic variance of \(W_n\) achieves the Cramer-Rao Lower Bound.
Theorem (asymptotic efficiency of MLEs) Let \(\hat{\theta}\) denote the MLE of \(\theta\), and \(\tau(\theta)\) be a continuous function of \(\theta\). Under some regularity conditions, we have
where \(\nu(\theta)\) again is the Cramer-Rao Lower Bound. Therefore, \(\tau(\hat{\theta})\) is a consistent and asymptotically efficient estimator of \(\tau(\theta)\)
Proof idea: Applying Talyor series to \(l(\theta | x) = \sum \log f(x_i | \theta)\)
Notice the first term is 0 because of MLE, we can ignore higher order and rearrage to obtain
where the nominator convergs to \(n(0, I(\theta_0))\) in distribution and denominator converges to \(I(\theta_0)\) in probability, then by Slutsky theorem, we proves the statement.
2. Hypothesis Testing
Like point estimation, hypothesis testing is another statistical inference method.
Definition (hypothesis) A hypothesis is a statement about a population parameter. The complementary hypotheses are called null hypothesis and alternative hypothesis, denoted by \(H_0, H_1\) respectively. The general form of hypothesis about \(\theta\) is that \(H_0 = \Theta_0\) and \(H_1 = \Theta_0^c\)
If the hypothesis specify only one possible distribution (the parameter space has only one element, e.g; \(\Theta = \{ 0.5 \}\)), then it is called a simple hypothesis, otherwise it is called composite hypothesis. It is often the case that the null hypothesis is chosen to be a simple hypothesis.
Definition (hypothesis test) A hypothesis test is a rule that specifies
- For which sample value the decision is made to accept \(H_0\)
- For which sample value \(H_0\) is rejected and \(H_1\) is accepted as true
Be careful that to say accept/reject a hypothesis is not equivalent to saying that the hypothesis is true/false.
procedure of hypothesis test
Typically, a hypothesis test is specified in terms of a test statistic. In particular, we can construct test by
- choose a test statistics \(T(X_1, ..., X_n)\)
- choose a critical value \(t\) and define the rejection region \(R=\{ (x_1,...,x_n) | T(x_1,...,x_n) \geq t \}\)
- if \(T \geq t\) or equivalently \((X_1, ..., X_n) \in R\), we reject \(H_0\) otherwise we retain \(H_0\)
2.1. Methods of Finding Tests
2.1.1. Neyman-Pearson
Simple vs Simple
Definition (Neyman-Pearson Lemma) Consider testing: \(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\)
Let \(L(\theta) = p(X_1, ..., X_n; \theta)\) and
Suppose we reject \(H_0\) if \(T_n > k\) where \(k\) is choosen so that \(P_{\theta_0}(X^n \in R)=\alpha\). This is a UMP level \(\alpha\) test
2.1.2. Wald Tests
Simple vs Composite
Definition (Wald test) Suppose we are interested in testing the hypothesis \(H_0: \theta = \theta_0\) vs \(H_1: \theta \neq \theta_0\). The Wald test is based on an asymptotically normal estimator. Let \(\hat{\theta}\) be a MLE estimator
Or if the \(\sigma_0\) is unknown, we can use its estimator \(S\) instead
Under the null, we simply reject it if \(|T_n| \geq z_{\alpha/2}\)
This only controls Type-I error asymptotically, but is standard in applications.
Bernoulli Wald test
Let \(X_1, ..., X_n \sim Ber(p)\), then the Wald test is based on the statistics
alternatively, we can use
2.1.3. Likelihood Ratio Tests
Composite vs Composite
Definition (likelihood ratio test statistic, LRT) The likelihood ratio test statistic for testing \(H_0 : \theta \in \Theta_0\) versus \(H_1: \theta \in \Theta_0^c\) is
A likelihood ratio test (LRT) is any test that has a rejection region of the form \(\{ \mathbf{x} : \lambda(\mathbf{x}) \leq c \}\) where \(0 \leq c \leq 1\)
normal LRT
Let \(X_1, ..., X_n\) be a random sample from \(n(\theta, 1)\) distribution, suppose the test \(H_0: \theta = \theta_0, H_1: \theta \neq \theta_0\)
The LRT statistic is
The rejection region \(\{ x: \lambda(x) \leq c \}\) is
Note that the rejection region can be simplified to an experssion involving a simpler sufficient statistic.
Theorem (test based on sufficient statistics) If \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) and \(\lambda^{*}(t), \lambda(\mathbf{x})\) are the LRT statistics based on \(T,\mathbf{X}\), then \(\lambda^{*}(T(\mathbf{x}))=\lambda(\mathbf{x})\) for every \(\mathbf{x}\) in the sample space
LRT of simple vs composite hypothesis has an asymptotic approximation
Theorem (Wilks' phenomenon) Consider testing \(H_0: \theta = \theta_0\) versus \(H_1: \theta \neq \theta_0\). Under \(H_0\), we can approximate
Proof This can be shown using Taylor expansion:
therefore
where the first term converges to 1 by WLLN and second term converges \(N(0,1)^2\).
2.1.4. Score Test (Lagrange Multiplier Test)
If we are testing \(H_0: \theta = \theta_0\), consider the score statistic
If null hypothesis is true We know \(E[S(\theta_0)] = 0\) and \(Var S(\theta) = I_n(\theta)\). Then the test statistic for the score test is
If null hypothesis is true, this statistic has mean 0 and variance 1
2.1.5. Bayesian Test
In a Bayesian model, we have a prior, so we can combine it with likelihood to get posterior \(\pi(\theta | x)\), we can create a test by comparing the posteior probability
or simply compare the probability directorly
2.1.6. Union Intersection Test
Theorem (union-intersection) The union-intersection method of test construction might be useful when the null hypothesis is conveniently experssed as an intersection.
Suppose the test for each problem is \(H_{0\gamma}\) vs \(H_{1\gamma}\), and the rejection region for the test is \(\{ x: T_{\gamma}(x) \in R_{\gamma} \}\), the rejection region is their union
2.2. Methods of Evaluating Tests
Definition (Type I, II error)
- Type I error is the false negative error (i.e.: \(\theta \in \Theta_0\) but \(H_0\) is rejected)
- Type II error is the false positive error (i.e: \(\theta \in \Theta_1\) but \(H_0\) is accepted)
In general, an attempt to decrease one type of error is accompanied by an increase in the other type of error, so a compromise has to be made. The only way to reduce both types of error is to increase the sample size.
Definition (power function) Suppose \(R\) denotes the reject region for a test. Then the power function of this test is the function of \(\theta\) defined by
Ideally, we would like \(\beta(\theta)=0\) when \(\theta \in \Theta_0\) (minimize False Negative, Type I Error) and \(\beta(\theta)=1\) when \(\theta \in \Theta_0^c\) (minize False Positive, Type II Error)
Typically, the power function of a test will depend on the sample size \(n\), therefore by considering the power function, the experimenter can choose \(n\) to achieve some test goal.
normal power function
Let \(X_1, ..., X_n\) be a random sample from \(n(\theta, \sigma^2)\) population with \(\sigma^2\) known.
We consider a LRT \(H_0: \theta \leq \theta_0\) VS \(H_1: \theta > \theta_0\) is a test that the rejection region is
Then, the power function of this test is
where \(Z\) is a standard normal random variable.
For a fixed sample size, it is impossible to make both errors arbitrarily small, so it is common to restrict consideration to tests that control the Type I Error at a specified level, and within that level, minimize Type II as much as possible.
point estimation and hypothesis testing
Notice the similarity between point estimation and hypothesis testing
- in point estimation, we restrict estimators to a restricted class (unbiased estimator) and then minimize variance within that class
- in hypothesis testing, we restrict test to a specfic level-\(\alpha\) test and minimize the Type II error (find the most powerful test) within this class.
Definition (size \(\alpha\) test, alpha \(\alpha\) test)
For \(0 \leq \alpha \leq 1\), a test with power function \(\beta(\theta)\) is called size \(\alpha\) test if \(\sup_{\theta \in \Theta_0} \beta(\theta) = \alpha\). It is called level \(\alpha\) test if \(\sup_{\theta \in \Theta_0} \beta(\theta) \leq \alpha\).
\(\alpha\) is called the level of significance
The previous tests only yields test statistics and general form for rejection regions, but do not lead to one specific test. For example, LRT does not specify \(c\). The restriction to size \(\alpha\) may lead to the choice of \(c\) out of the class of tests.
Definition (unbiased test) A test with power function \(\beta(\theta)\) is called unbiased iff \(\beta(\theta') \geq \beta(\theta'')\) for all \(\theta' \in \Theta_0^c, \theta'' \in \Theta_0\)
2.2.1. UMP Test
Definition (uniformly most powerful) Let \(\mathcal{C}\) be a class of tests for testing \(H_0: \theta \in \Theta_0\) versus \(H_1: \theta \in \Theta_0^c\). A test in class \(\mathcal{C}\), with power function \(\beta(\theta)\) is a uniformly most powerful (UMP) class test if \(\beta(\theta) \geq \beta'(\theta)\) for every \(\theta \in \Theta_0^c\) and every \(\beta'(\theta)\) that is a power function of a test in the class.
Theorem (Neyman-Pearson) Consider testing \(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\), using a test with rejection region \(R\) that satisfies
for some \(k \geq 0\) and
-
(Sufficiency) Any test that satisfies those conditions are UMP level \(\alpha\) test
-
(Necessity) If there exists a test satisfying those condition with \(k > 0\), then every UMP level \(\alpha\) test is a size \(\alpha\) test.
Theorem (Karlin-Rubin) Consider testing \(H_0: \theta \leq \theta_0\) vs \(H_1: \theta > \theta_0\). Suppose that \(T\) is a sufficient statistic for \(\theta\) and the family of pdfs \(\{ g(t|\theta): \theta \in \Theta \}\) of \(T\) has an MLR (monontone likelihood ratio) property. Then for any \(t_0\), the test that rejects \(H_0\) iff \(T > t_0\) is a UMP level \(\alpha\) test, where \(\alpha = P_{\theta_0}(T > t_0)\)
2.2.2. P-value
Reporting "rejecting \(H_0\)" or "retain \(H_0\)" is not very informatvie. Instead, we could ask, for every \(\alpha\), whether the test rejects at that level. At \(\alpha=1\) we will always reject and \(\alpha=0\) we will always retain. Therefore, there is a smallest threshold rejection \(\alpha\) and this is the p-value.
Intuitively, p-value is the lowest significance level \(\alpha\) that results in rejecting the null hypothesis.
Definition (formal p-value) A p-value \(p(X)\) is a test statistics satisfying \(0 \leq p(x) \leq 1\) for every sample \(x\), small values of \(p(x)\) give evidence that \(H_1\) is true. A p-value is valid iff for every \(\theta \in \Theta_0, 0 \leq \alpha \leq 1\)
The most common way to define a valid p-value is
Theorem (p-value) Let \(W(X)\) be a test statistic such that large values of \(W\) give evidence that \(H_1\) is true. For each sample point \(x\), define
Then, \(p(X)\) is a valid p-value
Intuitively, this means the p-value is the probability under \(H_0\) of observing a value of the test statistic the same as or more extreme that what was actually observed
2.2.3. Multiple Testing
Definition (Family-Wise Error Rate) the probability that we falsely reject any null hypothesis in multiple tests.
Suppose we want to control the FWER at \(\alpha\) of \(d\) hypothesis test, we can use the following corrections:
Theorem (Sidak correction) we reject any test if the p-value is smaller than
If the p-values are all independent, then \(FWER \leq \alpha\)
Theorem (Bonferroni correction) If we do not know the independence, we can use the Bonferroni correction: we reject any test if p-value is smaller than
The cost of this protection is they tend to fail to reject false null hypothesis (therefore more Type-2 errors), Bonferroni correction can be improved with the Holm's procedure, it is uniformly more powerful than Bonferroni correction
Theorem (Holm's procedure)
- order the p-values \(p_{(1)} \leq p_{(2)} \leq ... \leq p_{(d)}\)
- if \(p_{(1)} \leq \alpha/d\), then reject \(H_1\) and move on, otherwise stop and accept all \(H_i\)
- if \(p_{(2)} \leq \alpha/(d-1)\), then reject \(H_2\) and move on, otherwise stop and accept \(H_2, ..., H_d\)
- if \(p_{(d)} \leq \alpha\), then reject \(H_d\), otherwise accept \(H_d\)
Definition (false discovery rate) Denote the number of false rejections as \(V\), the total number of rejections as \(R\), then the false discovery proportion is
when \(R> 0\), otherwise \(FDP = 0\). the FDR is
Family-Wise Error Rate (FWER) is connected this by
BH procedure controls FDR under independence. It turns out to be very challenging to tightly control FDR under strong dependence.
Theorem (BH procedure)
- sort all p-values \(p_{(1)} \leq p_{(2)} \leq ... \leq p_{(d)}\)
- define the threshold \(t_i = i\alpha/d\)
- find the largest \(i\) such that
Reject all nulls upto and include \(i_{max}\)
Lemma (FWER vs FDR) Under the global null, FDR control is equivalent to FWER control, and controling FWER always controls FDR
2.3. Specific Tests
The following famous tests can be derived using the previous approaches.
2.3.1. T-tests
Test (two-sample t-test) Let \(X_1, ..., x_n\) be a random sample from a \(n(\mu_x, \sigma_x^2)\), and \(Y_1, ..., Y_m\) be a independent sample from a \(n(\mu_Y, \sigma_Y^2)\). We are interested in testing
with the assumption that \(\sigma_X^2 = \sigma_Y^2 = \sigma^2\)
The LRT for these hypotheses: under the \(H_0\), the statistics
where
Check exercise 8.41 in C&B
Definition (Behrens–Fisher problem) It is not always true to assume the variance is equal, it is called Behrens–Fisher problem in such a case.
Test (Welch's t-test) it gives an approximation answer to the Behrens-Fisher problem
When we do not assume \(\sigma_X^2 = \sigma_Y^2\). the statistic
where \(S_X^2 = \frac{1}{n-1} \sum_i (X_i - \bar{X})^2, S_Y^2 = \frac{1}{m-1} \sum_i (Y_i - \bar{Y})^2\)
Check exercise 8.42 in C&B
2.3.2. F-tests
An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis
2.3.3. Goodness of Fit
Test (G-test) G-test is increasingly being used in situations where \(\chi^2\) test were used. The statistics is
\(\chi^2\) test is an approximation of G-test using Taylor expansion, it fails when the sample size is small
Test (Pearson \(\chi^2\) test) Consider \(H_0: P=P_0\) vs \(H_1: P \neq P_0\) where \(P_0, P\) are multinomial distribution with \(k\) categories, for \(P_0\) we have \((p_1, ..., p_k)\) where \(\sum_i p_i=1\). Given a sample \(X_1, ..., X_n\), we reduce it to a count vector \(O_1, ..., O_k\) observations and expected vector are \(E_1, ..., E_k\)
Note it can also be used to test independence.
Even the distribution is not multinomial, we can reduce it to a multinomial distribution by binning and apply this test.
Test (Kolmogorov–Smirnov test) Test the difference between true CDF \(F(x)\) and empirical CDF \(\hat{F}_n(x)\)
By Glivenko-Cantelli, we know this statistic is small when \(n\) is large enough. This can be used for example, to do the normality test.
Test (Cramer von-Mises) An alternative test is Cramer von-Mises test where the statistics is
2.3.4. Rank Tests
2.3.5. Permutation Test
Test (2-sample permutation test) To decide whether two samples \(X_1, ..., X_n\), \(Y_1, ..., Y_m\) are from the same distribution. We define \(N=m+n\) and consider all \(N!\) permutations. Suppose our test statistics is \(T\), then it will yield \(Y_1, ..., Y_{N!}\). Under the null hypothesis, they has the same distribution, we can evaluate the p-value that
In practice, \(N!\) is too large to compute, we randomly permute the data and compute the statistics.
3. Interval Estimation
Interval estimation, or more generally, set estimation, is an inference which states that \(\theta \in C(\mathbf{x})\) where \(C\) is determined by the observed value \(\mathbf{x}\).
Definition (interval estimate) An interval estimate of a parameter \(\theta\) is any pair of functions, \(L(x_1, ..., x_n)\) and \(U(x_1, ..., x_n)\) of a sample that satisfy \(L(\mathbf{x}) \leq U(\mathbf{x})\) for all \(\mathbf{x}\). The following random interval is called an interval estimator
Definition (coverage probability) For an interval estimator \([L(X), U(X)]\), the coverage proability is the probability that the random interval covers the true parameter \(\theta\)
Note that it is actually a statement about random variable \(X\), not \(\theta\). It is equivalent to
The coverage probability might depends on \(\theta\), the guaranteed confidence is the following:
Definition (confidence coefficient) the confidence coefficient of an interval estimator is
Definition (confidence interval) interval estimators, together with the confidence coefficient are known as confidence intervals.
3.1. Methods of Finding Interval Estimators
3.1.1. Inverting a Test Statistic
Theorem (hypothesis test and confidence sets) For each \(\theta_0 \in \Theta\), let \(A(\theta_0)\) be the acceptance region of a level \(\alpha\) test of \(H_0: \theta = \theta_0\). For each \(x\), define a set \(C(x)\) in the parameter space by
Then the random set \(C(X)\) is a \(1-\alpha\) confidence set. Conversely, let \(C(X)\) be a \(1-\alpha\) confidence set. For any \(\theta_0 \in \Theta\), define
Then \(A(\theta_0)\) is the acceptance region of a level \(\alpha\) test of \(H_0: \theta = \theta_0\)
3.1.2. Pivotal Quantities
Definition (pivotal quantity) Let \((X_i)_{i=1}^{n}\) be a random sample from a distribution with parameter \(\theta\) that is to be estimated. The random variable \(Q\) is said to be a pivotal quantity iff:
1) It is a function of \((X_i)_{i=1}^{n}\) and the unknown parameter \(\theta\), but it does not depend on any other params
2) The probability distribtuion of \(Q\) does not depend on \(\theta\) or any other unknown params
The steps in the pivotal method for finding confidence interval are:
- find a pivotal quantity \(Q(X_1, ..., X_n, \theta)\)
- Find an interval for \(Q\) such that \(P(q_l \leq Q \leq q_h) = 1 - \alpha\)
- Using algebraic manipulation to obtain \(P(\hat{\Theta}_l \leq \theta \leq \hat{\Theta}_h ) = 1-\alpha\)
normal mean interval estimation with known variance
Let \(X_1, ..., X_n\) be a random sample from \(N(\theta, 1)\), we know that
and
therefore \(Q\) is a pivotal quantity and can be used to find the confidence interval where
4. Reference
[0] CMU 36705 Intermediate Statistics notes
[1] H. Pishro-Nik, "Introduction to probability, statistics, and random processes", available at https://www.probabilitycourse.com, Kappa Research LLC, 2014.
[2] Wasserman, Larry. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
[3] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.
[4] Hogg, Robert V., Joseph McKean, and Allen T. Craig. Introduction to mathematical statistics. Pearson Education, 2005.