0x401 Classical Inference

1. Point Estimation
2. Hypothesis Testing
3. Interval Estimation
- 3.1. Methods of Finding Interval Estimators
  - 3.1.1. Inverting a Test Statistic
  - 3.1.2. Pivotal Quantities
4. Reference

The basic problem of statistical inference is the inverse of probability: Given the outcomes, what can we say about the process that generated the data?

More formally, Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.

Frequentist Inference: the unknown quantity is assumed to be a fixed quantity
Bayesian Inference: the unknown quantity is assumed to be a random variable, we have some initial guess about the distribution, and update the distribution after observing the data

1. Point Estimation

Point estimation consists of two part: how to find a point estimator, how to evaluate them

Definition (point estimator) A point estimator is any function \(W(X_1, X_2, ..., X_n)\) of a sample; that is, any statistic is a point estimator.

An estimator is a function of the sample, an estimate is the realized value of estimator. There might be natural candidates for point estimators, but not always follow our intuitions

1.1. Methods of Finding Estimators

1.1.1. Method of Moments

This method is perhaps the oldest method of finding point estimators, it is a good start place when other methods prove intractable. Also it can be applied to obtain approximations to the distributions of statistics. (Satterthwaite approximation)

Definition (method of moments estimator) Equating the sample moments to the corresponding population moments and solve the resulting system.

\[\mu_1(\theta_1, ..., \theta_k) = \frac{1}{n} \sum_i X_i\]

\[\mu_2(\theta_1, ..., \theta_k) = \frac{1}{n} \sum_i X^2_i\]

\[...\]

\[\mu_k(\theta_1, ..., \theta_k) = \frac{1}{n} \sum_i X^k_i\]

The left side is a function of \(\theta_1, ..., \theta_k\) and the right side and moments statistics. Solving this systems we obtain

\[\theta_i = f(\frac{1}{n}\sum_i X_1, \frac{1}{n}\sum_i X^2_1, ..., \frac{1}{n}\sum_i X^k_1)\]

which is the estimator for \(\theta_i\)

MoM of Gaussian distribution

Consider \(X_1, ..., X_n \sim N(\mu, \sigma^2)\), then by applying MoM

\[\hat{mu} = \sum_i X_i\]

\[\hat{\mu}^2 + \hat{\sigma}^2 = \sum_i X_i^2\]

By solving this, we get \(\hat{\mu}, \hat{\sigma}^2\)

MoM of GMM

Sometimes MoM is easier to solve than the MLE, for example if a GMM has 2 gaussian and 5 parameters, we can compute up to 5 or 6 moments (using MGF) and solve with MoM. It is sometimes stable than the MLE

1.1.2. Maximum Likelihood Estimators

Definition (maximum likelihood estimator) Maximum likelihood estimator \(\hat{\theta}(\mathbf{x})\) is the parameter \(\theta\) to maximize \(L(\theta | \mathbf{x})\)

\[L(\theta | \mathbf{x}) = \prod_i f(x_i | \theta_1, ..., \theta_k)\]

Theorem (invariance property of MLE) If \(\hat{\theta}\) is the MLE of \(\theta\), then for any function \(\tau(\theta)\), the MLE of \(\tau(\theta)\) is \(\tau(\hat\theta)\)

Proposition (properties of MLE) - \(\hat{\theta}_{MLE}\) is asymptotically consistent - \(\hat{\theta}_{MLE}\) is asymptotically unbiased - \(\hat{\theta}_{MLE}\) is approximately a normal random variable

MLE and KL divergence

One way to interpret MLE is to view it as minimizing the dissimilarity between the empirical distribution \(P(x | \theta)\) and true distribution \(P(x | \theta^*)\). The dissimilarity is the KL divergence

\[D_{KL}(P(x|\theta^*) \| P(x | \theta)) = E_{x \sim P(x | \theta^*)} [\log P(x | \theta^*)] - E_{x \sim P(x | \theta^*)} [\log P(x | \theta)]\]

where the first term is constant and maximizing the second term is asymptotically equivalent to the minimization of MLE by the law of large number

Properties (Drawbacks of MLE) The drawbacks of MLE are - finding and verifying the global maximum is difficult - numerical sensitivity (easy to overfitting, high variance).

How to solve these two problems?

To solve the first one, we can apply the numerical approach. MLE might be maximized numerically if likelihood can be written down.
To solve the 2^nd, The Bayesian approach might be helpful to solve this issue or try to scale up the dataset...

For example, in the Markov language model, MLE language model have the zero count issue, therefore smoothing is necessary. This is a Bayesian approach: Laplace smoothing can be interpreted as a MAP with a uniform prior.

1.1.3. Bayes Estimators

In the classical approach, the parameter \(\theta\) is thought to be an unknown, but fixed quantity. In the Bayesian approach, \(\theta\) is considered to be a random variable from a probability distribution \(\pi(\theta)\)(prior distribution), which is subjective distribution by the experimenter's belief.

The distribution is updated into posterior distribution \(\pi(\theta|x)\) based on sample observed

\[\pi(\theta|x) =\frac{f(x|\theta) \pi(\theta)}{\int f(x|\theta)\pi(\theta) d\theta}\]

Note that there are lots of ways to get Bayes estimators. For example,

One way to compute Bayes point estimator is to take its mean

\[\hat{\theta} = E[\pi(\theta|x)]\]

Definition (MAP) Another is to get the mode, which is the maximum a posteriori probability (MAP) estimate \(\(\hat{\theta} = \mathrm{argmax}_\theta \pi(\theta|x)\)\)

Theorem (Bernstein-von Mises) Under some regularity condition, the posterior is close to a Guassian distribution

\[|| \pi(\theta | X_1, ..., X_n) - N(\hat{\theta}_n, I(\hat{\theta}_n)^{-1}/n) ||_{TV} \to 0\]

where \(\hat{\theta}_n\) is MLE and the distance measure is the total variation.

Example

Suppose we want to model a binary random variable \(X \in \{ 0, 1 \}\) with Bernoulli distribution

Given the sample \(\mathcal{D}=\{X_1, ..., X_n \}\), we know the likelihood can be written as

\[p(\mathcal{D}|\mu) = \prod_n \mu^{X_n}(1-\mu)^{1-X_n}\]

The ML estimator is

\[\hat{\mu}_{ML} = \frac{m}{N}\]

where \(m = \sum_n X_i\) is the sufficient statistics.

The conjugate of Bernoulli/Binomial distribution is the Beta distribution, recall its pdf form is

\[B(\mu | a,b) = \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)} \mu^{a-1}(1-\mu)^{b-1}\]

It has the same shape of \(\mu^X (1-\mu)^{1-X}\), therefore can serve as the prior and posterior.

The posterior has the form

\[\mu(\mu | m, l, a, b) \propto \mu^{m+a-1}(1-\mu)^{l+b-1}\]

Using the properties of beta distributions, we know the MAP estimator is

\[\hat{\mu}_{MAP} = \frac{m+a-1}{N+a+b-2}\]

however, be careful that posterior mean is different

\[\bar{\mu} = \frac{m+a}{N+a+b}\]

If our goal is to predict the outcome of the next trial, we can

\[p(x=1 | \mathcal{D}) = \int_0^1 p(x=1|\mu)p(\mu|\mathcal{D}) d\mu = E[\mu | \mathcal{D}] = \bar{\mu}\]

1.2. Methods of Evaluating Estimators

The general topic of evaluating statistical procedures is the branch known as decision theory

1.2.1. Mean Squared Error

Definition (mean squared error) The mean squared error (MSE) of an estimator \(W\) of a parameter \(\theta\) is the function of \(\theta\) defined by \(E_{\theta}(W-\theta)^2\)

\[E_{\theta} (W(x1, ..., x_n) - \theta)^2 = \int ... \int (W(x_1, ..., x_n) - \theta)^2 f(x_1|\theta) ... f(x_n|\theta) dx_1 ... dx_n\]

Lemma (bias-variance decomposition) The mean square error have a bias-variance decomposition as follows:

\[E_{\theta} (W-\theta)^2 = (E_\theta W - \theta)^2 + Var_\theta(W)\]

In ML terms (Andrew Ng' lecture)

Bias is an error from the algorithm/estimator itself, it is corresponding to underfitting (high error in training set)
Variance is an error from sensitivity to small fluctuations in the training set, which causes the bad performance in dev set. It is corresponding to overfitting (high error in test set).

Definition (bias) The bias of a point estimator \(W\) of a parameter \(\theta\) is the difference between the expected value of \(W\) and \(\theta\); that is

\[Bias_{\theta}W = E_{\theta}W - \theta\]

Definition (unbiasedness) An estimator whose bias is identically equal to 0 is called unbiased when for all \(\theta\)

\[E_{\theta} W = \theta\]

MSE for normal distribution

Let \(X_1, ..., X_n\) be \(n(\mu, \sigma^2)\) distribution, the statistics \(\bar{X}, S^2\) are both unbiased estimator, therefore their MSE are

\[E(\bar{X} - \mu)^2 = Var(\bar{X}) = \frac{\sigma^2}{n}\]

\[E(S^2 - \sigma)^2 = Var(S^2) = \frac{2\sigma^4}{n-1}\]

small increase in bias might decrase a large variance

Let \(X_1, ..., X_n\) be \(n(\mu, \sigma^2)\) distribtuion, compare the unbiased estimator \(S^2\) and the MLE (biased)

\[\hat{\sigma}^2 = \frac{n-1}{n}S^2\]

We can evaluate its MSE as

\[E(\hat{\sigma}^2 - \sigma^2)^2 = \frac{2n-1}{n^2}\sigma^4\]

This MSE is smaller than the MSE for the unbiased estimator (because variance is smaller)

1.2.2. Best Unbiased Estimator

There is no one "best MSE" estimator as the class of all estimators is too large (e.g: the estimator \(\hat{\theta}=17\) cannot be beaten in MSE at \(\theta=17\), but is terrible otherwise).

One way to make it trackable is to limit the class of estimators to consider the unbiased estimators, under this assumption, we can only compare the variance to minimize MSE

Definition (unbiased best estimator) An esitmator \(W^*\) is a best unbiased estimator of \(\tau(\theta)\) if it satisfies \(E_\theta W^* = \tau(\theta)\) for all \(\theta\). For any other estimator \(W\) with \(E_\theta(W) = \tau(\theta)\) we have for all \(\theta\)

\[Var_{\theta} W^* \leq Var_{\theta} W\]

\(W^*\) is also called a uniform minimum variance unbiased estimator.

Even within the class of unbiased estimators, candidates of best estimators can be infinitely many, so it might be hard to verify that an estimator is the best one. To find the lower bound, we need to define some related concepts.

Definition (score function) The score function is defined to be

\[s(X; \theta) = \frac{\partial{\log f(X; \theta)}}{\partial \theta}\]

Score function is a random variable over parameter \(\theta\). It also can be defined using a sample \(X = (X_1, ..., X_n)\)

\[s(X; \theta) = \sum_i s(X_i; \theta) = \sum_i \nabla_\theta \log f(x| \theta)\]

Lemma (expectation of score function) One property of score function is its expectation is 0

\[E(s(X; \theta)) = 0\]

Proof: by Dominated Convergence Theorem, we can derive this as follows:

\[\int \nabla_\theta\log(p(x; \theta)) p(x) dx = \int \frac{\nabla p(x; \theta)}{p(x)}p(x) dx = \int\nabla p(x; \theta) = \nabla \int p(x; \theta) = 0\]

Lemma (variance of score function) variance of the score function is called the Fisher information

\[Var(s(X; \theta)) = I(\theta)\]

In essense, the Fisher information is measuring the expected curvature of the log-likelihood function around the point \(\theta\). If the log-likelihood is more curved (larger \(I(\theta)\)), then \(\theta\) is easier to estimate.

Definition (Fisher information) As discussed, the Fisher information is defined as

\[I_n(\theta) = Var(s(X; \theta)) = E_\theta(s(X; \theta)s(X; \theta)^T)\]

When there are \(d\) parameters, \(\theta = (\theta_1, ..., \theta_d)\), then the Fisher information is a matrix, called Fisher Information Matrix (FIM) with shape \(d \times d\). The matrix is positive semidefinte matrix. If it is positive definite matrix, it defines a Riemannian metric on the \(d\) dimensional parameter space, which is used in Information Geometry.

It is easy to see

\[I_n(\theta) = nI(\theta)\]

Under some regularies, this can be simplified into

\[I(\theta) = - E_\theta(\frac{\partial^2}{\partial \theta^2}\log f(X|\theta))\]

The bigger Fisher information indicates more information about \(\theta\), therefore small variance of the best unbiased estimator

Bernoulli distribution

Let \(X \sim Ber(p)\) , then the loglikelihood is

\[\log f(X; \theta) = x \log(p) + (1-x)\log(1-p)\]

Take the derivative gives the score function

\[s(X; \theta) = \frac{x-p}{p(1-p)}\]

The Fisher information is

\[I(p) = E(s(X)^2) = \frac{1}{p(1-p)}\]

Normal distribution

Let \(X \sim N(\mu, \sigma^2)\), then the score for \(\mu\) is

\[s(X; \mu) = -\frac{(X-\mu)^2}{2\sigma^2}\]

The Fisher information is

\[I(\mu) = E(s(X; \mu)^2) = \frac{1}{\sigma^2}\]

exponential family

The loglikelihood of exponential family is as follows (by ignoring constants)

\[\log f(X; \theta) = \sum_i \theta_i \sum_j T_i(X_j) - nA(\theta)\]

The fisher information is simply the Hessian of log-partition function

\[I(\theta) = n \nabla^2_\theta A(\theta)\]

Cramer-Rao gives the lower bound on the variance of an unbiased estimator

Theorem (unbiased scalar Cramer-Rao Lower bound)

\[Var (\hat{\theta}) \geq \frac{1}{I(\theta)}\]

Theorem (scalar Cramer-Rao Lower bound) Let \(X_1, ..., X_n\) be a sample of pdf \(f(x|\theta)\), and let \(W(X)=W(X_1, ..., X_n)\) be any estimator satisfying

\[ \frac{d}{d\theta} E_\theta W(X) = \int \frac{\partial}{\partial \theta} [W(x)f(x|\theta)] dx\]

and

\[Var_\theta W(X) < \infty\]

then

\[Var_\theta (W(X)) \geq \frac{(\frac{d}{d\theta} E_\theta W(X))^2}{E_\theta((\frac{\partial}{\partial \theta} log(f(X|\theta)))^2)}\]

Note the formula might be cleaned a bit more, if \(W(X)\) is the unbiased estimator, then obviously the nominator is

\[\frac{d}{d\theta} E_\theta W(X) = 1\]

Under the iid assumption of sample, the denominator can be

\[E_\theta((\frac{\partial}{\partial \theta} log(f(X|\theta)))^2) = n E_\theta ((\frac{\partial}{\partial \theta} \log f(x|\theta))^2)\]

A shortcoming of Cramer-Rao is its bound may be strictly smaller than the variance of any unbiased estimator.

normal variance bound cannot be attained

Let \(X_1, ..., X_n\) be \(n(\mu, \sigma^2)\) and consider estimation of \(\sigma^2\) where \(\mu\) is unknown.

The CR lower bound of any unbliased estimator \(W\) of \(\sigma^2\) is

\[Var(W|\mu, \sigma^2) \geq 2\sigma^4/n\]

However, the sample variance has

\[Var(S^2|\mu, \sigma^2) = 2\sigma^4/(n-1)\]

When \(\mu\) is unknown, the bound cannot be attained

1.2.3. Sufficiency and Unbiasedness

Theorem (Rao-Blackwell) Let \(W\) be an unbiased estimator for \(\tau(\theta)\), and \(T\) be a sufficient statistic for \(\theta\). Then the following \(\phi(T)\) is a uniformly better unbiased estimator of \(\tau(\theta)\)

\[\phi(T)=E(W|T)\]

1.3. Decision Theory

Mean square error loss is a special case of loss function, the study of the performance of estimators using loss function is a branch of decision theory. A decision theoretic analysis would judge how well an estimator succeeded in simultaneously minimizing both bias and variance.

Definition (risk function) The quality of an estimator is quantified in its risk function wrt estimator \(\delta(X)\)

\[R(\theta, \delta) = E_{\theta} L(\theta, \delta(X))\]

MSE is the risk with square error loss

For square error loss,

\[L(\theta, a) = (a - \theta)^2\]

the risk function is mean square error (MSE) where

\[MSE(\theta) = R(\theta, \delta) = E_{\theta}(\delta - \theta)^2 = \mathrm{Var}_\theta \delta(X) + (\mathrm{Bias} \delta(X))^2\]

For a fixed \(\theta\), the risk function is the average loss that will be incurred if the estimator \(\delta(x)\) is used. However, since the true \(\theta\) is unknown, we would like to use an estimator that has a samll value \(R(\theta, \delta)\) for all values of \(\theta\)

As the optimality will change based on different \(\theta\), to compare two estimators, we compare their risk functions. However there is no clear answer which one is better

there is no uniformly better estimator

Let \(X \sim N(\theta, 1)\) and we consider two estimators \(\hat{\theta}_1 = X\) and \(\hat{\theta}_2 = 3\). The risk function of the first one is \(R(\theta, \hat{\theta}_1) = Var(\theta) = 1\), the other is \(R(\theta, \hat{\theta}_2) = (\theta-3)^2\). Neither estimator donimate the other uniformly.

Therefore, we need a one number summary of risk function. Two of such summary are Bayes risk and maximum risk, each of them suggests methods of devising estimator

Bayes risk leads to Bayes estimator
Maximum risk leads to Minimax estimator

1.3.1. Bayes Risk

Definition (Bayes risk) In Bayesian analysis we would use the prior distribution \(\pi(\theta)\) to compute an average risk, known as the Bayes risk

\[R_{B} (\delta) = \int_{\Theta} R(\theta, \delta) \pi(\theta) d\theta\]

The Bayes estimator or Bayes decision rule is one which minimizes the expected risk:

\[\delta_{B} = \text{argmin}_{\delta} (R_B(\delta))\]

An alternative approach is to use minimax risk

The Bayes risk can be rewritten using posterior risk \(r(\delta | x^n)\) where

\[r(\delta | x^n) = \int L(\theta, \delta) \pi(\theta | x^n) d\theta\]

Theorem (Bayes estimator) The Bayes risk satisfies

\[R_B(\delta) = \int r(\delta | x^n) m(x^n) dx\]

where the Bayes estimator is to minimize posterior risk for each \(x\)

\[\delta_B = \text{argmin}_\delta r(\delta | x^n)\]

Bayes estimator for mean square error

For squared error loss, the posterioir risk is

\[r(\delta | x^n) \int_\theta (\theta - \delta)^2 \pi(\theta | x) d\theta = E((\delta - \theta)^2 | x)\]

The minimizer is the Bayes estimator, it is known as the minimum mean square error (MMSE) estimator

\[\delta_B = E[\theta | x]\]

Bayes estimator for other losses

If \(L(\theta, \delta) = | \theta - \delta|\), then the Bayes estimator is the median of posterior, if loss is 0-1 loss, then the Bayes estimator is the mode of posterior

1.3.2. Minimax Risk

Definition (maximum risk) The maximum risk of an estimator is

\[R_{max}(\delta) = \sup_{\theta} R(\theta, \delta)\]

A minimax rule is one which minimizes the maximum risk

\[\delta_{MM} = \text{argmin}_{\delta} R_{max}(\delta)\]

We can give both upper and lower bound of minimax risk:

Lemma (upper bound of minimax risk) Given an estimator \(\hat{\theta}_{up}\), we can use it to upper bound minimax risk

\[\inf_{\tilde{\theta}} \sup_\theta R(\theta, \tilde{\theta}) \leq R(\theta, \hat{\theta}_{up})\]

Lemma (lower bound of minimax risk) The Bayes risk of the Bayes estimator for any prior \(\pi\) lower bound the minimax risk

\[B_{\pi}(\hat{\theta}_{low}) \leq B_{\pi}(\hat{\theta}_{mm}) \leq \sup_\theta(\theta, \theta_{mm})= \inf_{\tilde{\theta}} \sup_\theta R(\theta, \tilde{\theta})\]

Theorem (least favorable prior) Let \(\hat{\theta}\) be the Bayes estimator for some prior \(\pi\), if

\[(\forall \theta) R(\theta, \hat{\theta}) \leq B_{\pi}(\hat{\theta})\]

then \(\hat{\theta}\) is minimax and \(\pi\) is called a least favorable prior.

Theorem (constant risk) Suppose \(\hat{\theta}\) is the Bayes estimator wrt some prior \(\pi\). If the risk is constant, then \(\hat{\theta}\) is minimax.

1.4. Asymptotic Evaluation

1.4.1. Consistency

The consistency property requires the estimator converges to the correct value as the sample size becomes infinite

Definition (consistency) A sequence of estimators \(W_n = W_n(X_1, ..., X_n)\) is a consistent sequence of estimators of the parameter \(\theta\) if for every \(\epsilon > 0\) and every \(\theta \in \Theta\)

\[\lim_{n \to \infty} P_{\theta}(|W_n - \theta| < \epsilon ) = 1\]

Theorem (consistency of MLE) Let \(X_1, ..., X_n\) be iid \(f(x|\theta)\) and \(L(\theta |x)\) be the likelihood function, let \(\hat{\theta}\) denote the MLE of \(\theta\) and \(\tau{\theta}\) be a continuous function of \(\theta\), under some regularity conditions, for every \(\epsilon > 0\), \(\theta \in \Theta\)

\[\lim_{n \to \infty} P_{\theta}(|\tau(\hat{\theta}) - \tau(\theta)| \geq \epsilon) = 1\]

that is \(\tau(\hat{\theta})\) is a consistent estimator of \(\tau(\theta)\)

1.4.2. Efficiency

Efficiency is concerned with the asymptotic variance of an estimator

Definition (limiting variance) For an estimator \(T_n\), if \(\lim_{n \to \infty} k_n Var(T_n) = \tau^2 < \infty\), where \(k_n\) is a sequence of constants, then \(\tau^2\) is called the limiting variance or limit of the variances.

Definition (asymptotic variance) For an estimator \(T_n\) with the following condition, the parameter \(\sigma^2\) is called the asymptotic variance or variance of the limit distribution of \(T_n\)

\[k_n(T_n - \tau(\theta)) \to n(0, \sigma^2)\]

Definition (asymptotically efficient) A sequence of estimators \(W_n\) is asymptotically efficient for a parameter \(\tau(\theta)\) if \(\sqrt{n}[W_n - \tau(\theta)] \to n(0, \nu(\theta))\) in distribution and

\[\nu(\theta) = \frac{[\tau'(\theta)]^2}{E_\theta((\frac{\partial}{\partial \theta} \log f(X|\theta))^2)}\]

that is, the asymptotic variance of \(W_n\) achieves the Cramer-Rao Lower Bound.

Theorem (asymptotic efficiency of MLEs) Let \(\hat{\theta}\) denote the MLE of \(\theta\), and \(\tau(\theta)\) be a continuous function of \(\theta\). Under some regularity conditions, we have

\[\sqrt{n}[\tau(\hat{\theta} - \tau(\theta))] \to n(0, \nu(\theta))\]

where \(\nu(\theta)\) again is the Cramer-Rao Lower Bound. Therefore, \(\tau(\hat{\theta})\) is a consistent and asymptotically efficient estimator of \(\tau(\theta)\)

Proof idea: Applying Talyor series to \(l(\theta | x) = \sum \log f(x_i | \theta)\)

\[l'(\theta | x) = l'(\theta_0 | x) + (\theta - \theta_0)l''(\theta_0 | x) + ...\]

Notice the first term is 0 because of MLE, we can ignore higher order and rearrage to obtain

\[\sqrt{n}(\hat{\theta} - \theta_0) = \frac{-\frac{1}{\sqrt{n}} l'(\theta_0 | x)}{\frac{1}{n}l''(\theta_0 | x)}\]

where the nominator convergs to \(n(0, I(\theta_0))\) in distribution and denominator converges to \(I(\theta_0)\) in probability, then by Slutsky theorem, we proves the statement.

2. Hypothesis Testing

Like point estimation, hypothesis testing is another statistical inference method.

Definition (hypothesis) A hypothesis is a statement about a population parameter. The complementary hypotheses are called null hypothesis and alternative hypothesis, denoted by \(H_0, H_1\) respectively. The general form of hypothesis about \(\theta\) is that \(H_0 = \Theta_0\) and \(H_1 = \Theta_0^c\)

If the hypothesis specify only one possible distribution (the parameter space has only one element, e.g; \(\Theta = \{ 0.5 \}\)), then it is called a simple hypothesis, otherwise it is called composite hypothesis. It is often the case that the null hypothesis is chosen to be a simple hypothesis.

Definition (hypothesis test) A hypothesis test is a rule that specifies

For which sample value the decision is made to accept \(H_0\)
For which sample value \(H_0\) is rejected and \(H_1\) is accepted as true

Be careful that to say accept/reject a hypothesis is not equivalent to saying that the hypothesis is true/false.

procedure of hypothesis test

Typically, a hypothesis test is specified in terms of a test statistic. In particular, we can construct test by

choose a test statistics \(T(X_1, ..., X_n)\)
choose a critical value \(t\) and define the rejection region \(R=\{ (x_1,...,x_n) | T(x_1,...,x_n) \geq t \}\)
if \(T \geq t\) or equivalently \((X_1, ..., X_n) \in R\), we reject \(H_0\) otherwise we retain \(H_0\)

2.1. Methods of Finding Tests

2.1.1. Neyman-Pearson

Simple vs Simple

Definition (Neyman-Pearson Lemma) Consider testing: \(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\)

Let \(L(\theta) = p(X_1, ..., X_n; \theta)\) and

\[T_n = \frac{L(\theta_1)}{L(\theta_0)}\]

Suppose we reject \(H_0\) if \(T_n > k\) where \(k\) is choosen so that \(P_{\theta_0}(X^n \in R)=\alpha\). This is a UMP level \(\alpha\) test

2.1.2. Wald Tests

Simple vs Composite

Definition (Wald test) Suppose we are interested in testing the hypothesis \(H_0: \theta = \theta_0\) vs \(H_1: \theta \neq \theta_0\). The Wald test is based on an asymptotically normal estimator. Let \(\hat{\theta}\) be a MLE estimator

\[T_n = \frac{\hat{\theta} - \theta_0}{\sigma_0} \to N(0, 1)\]

Or if the \(\sigma_0\) is unknown, we can use its estimator \(S\) instead

\[T_n = \frac{\hat{\theta} - \theta_0}{S} \to N(0, 1)\]

Under the null, we simply reject it if \(|T_n| \geq z_{\alpha/2}\)

This only controls Type-I error asymptotically, but is standard in applications.

Bernoulli Wald test

Let \(X_1, ..., X_n \sim Ber(p)\), then the Wald test is based on the statistics

\[T_n = \frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}\]

alternatively, we can use

\[T_n = \frac{\hat{p}-p_0}{\sqrt{\hat{p}(1-\hat{p})/n}}\]

2.1.3. Likelihood Ratio Tests

Composite vs Composite

Definition (likelihood ratio test statistic, LRT) The likelihood ratio test statistic for testing \(H_0 : \theta \in \Theta_0\) versus \(H_1: \theta \in \Theta_0^c\) is

\[\lambda(x) = \frac{\sup_{\Theta_0} L(\theta | \mathbf{x})}{\sup_{\Theta} L(\theta | \mathbf{x})}\]

A likelihood ratio test (LRT) is any test that has a rejection region of the form \(\{ \mathbf{x} : \lambda(\mathbf{x}) \leq c \}\) where \(0 \leq c \leq 1\)

normal LRT

Let \(X_1, ..., X_n\) be a random sample from \(n(\theta, 1)\) distribution, suppose the test \(H_0: \theta = \theta_0, H_1: \theta \neq \theta_0\)

The LRT statistic is

\[\lambda(x) = \frac{L(\theta_0 | x)}{L(\bar{x}|x)} = \exp( -n (\bar{x} - \theta_0)^2)\]

The rejection region \(\{ x: \lambda(x) \leq c \}\) is

\[\{ |\bar{x} - \theta_0| \geq \sqrt{-2\log(c)/n} \}\]

Note that the rejection region can be simplified to an experssion involving a simpler sufficient statistic.

Theorem (test based on sufficient statistics) If \(T(\mathbf{X})\) is a sufficient statistic for \(\theta\) and \(\lambda^{*}(t), \lambda(\mathbf{x})\) are the LRT statistics based on \(T,\mathbf{X}\), then \(\lambda^{*}(T(\mathbf{x}))=\lambda(\mathbf{x})\) for every \(\mathbf{x}\) in the sample space

LRT of simple vs composite hypothesis has an asymptotic approximation

Theorem (Wilks' phenomenon) Consider testing \(H_0: \theta = \theta_0\) versus \(H_1: \theta \neq \theta_0\). Under \(H_0\), we can approximate

\[-2 \log(\lambda(X)) \sim \chi^2_1\]

Proof This can be shown using Taylor expansion:

\[l(\theta) \sim l(\hat{\theta}) + l^{''}(\hat{\theta})(\theta - \hat\theta)/2p\]

therefore

\[-2 \log \lambda = 2l(\hat{\theta}) - 2l(\theta_0) = \frac{(1/n)l^{''}(\hat{\theta})}{I_1(\theta_0)} (\sqrt{nI_1(\theta_0)}(\hat{\theta}-\theta_0)^2)\]

where the first term converges to 1 by WLLN and second term converges \(N(0,1)^2\).

2.1.4. Score Test (Lagrange Multiplier Test)

If we are testing \(H_0: \theta = \theta_0\), consider the score statistic

\[S(\theta) = \frac{\partial}{\partial \theta }\log f(X | \theta)\]

If null hypothesis is true We know \(E[S(\theta_0)] = 0\) and \(Var S(\theta) = I_n(\theta)\). Then the test statistic for the score test is

\[Z_S = S(\theta_0)/\sqrt{I_n(\theta_0)}\]

If null hypothesis is true, this statistic has mean 0 and variance 1

2.1.5. Bayesian Test

In a Bayesian model, we have a prior, so we can combine it with likelihood to get posterior \(\pi(\theta | x)\), we can create a test by comparing the posteior probability

\[\frac{P(\theta \in \Theta_0 | x)}{P(\theta \in \Theta^c_0 | x)}\]

or simply compare the probability directorly

\[P(\theta \in \Theta_0 | x) \geq P(\theta \in \Theta^c_0 | x)\]

2.1.6. Union Intersection Test

Theorem (union-intersection) The union-intersection method of test construction might be useful when the null hypothesis is conveniently experssed as an intersection.

\[H_0: \theta \in \bigcap_{\gamma \in \Gamma} \Theta_{\gamma}\]

Suppose the test for each problem is \(H_{0\gamma}\) vs \(H_{1\gamma}\), and the rejection region for the test is \(\{ x: T_{\gamma}(x) \in R_{\gamma} \}\), the rejection region is their union

\[\bigcup_{\gamma \in \Gamma} \{ x: T_{\gamma}(x) \in R_{\gamma} \}\]

2.2. Methods of Evaluating Tests

Definition (Type I, II error)

Type I error is the false negative error (i.e.: \(\theta \in \Theta_0\) but \(H_0\) is rejected)
Type II error is the false positive error (i.e: \(\theta \in \Theta_1\) but \(H_0\) is accepted)

In general, an attempt to decrease one type of error is accompanied by an increase in the other type of error, so a compromise has to be made. The only way to reduce both types of error is to increase the sample size.

Definition (power function) Suppose \(R\) denotes the reject region for a test. Then the power function of this test is the function of \(\theta\) defined by

\[\beta(\theta) = P_\theta (X \in R)\]

Ideally, we would like \(\beta(\theta)=0\) when \(\theta \in \Theta_0\) (minimize False Negative, Type I Error) and \(\beta(\theta)=1\) when \(\theta \in \Theta_0^c\) (minize False Positive, Type II Error)

Typically, the power function of a test will depend on the sample size \(n\), therefore by considering the power function, the experimenter can choose \(n\) to achieve some test goal.

normal power function

Let \(X_1, ..., X_n\) be a random sample from \(n(\theta, \sigma^2)\) population with \(\sigma^2\) known.

We consider a LRT \(H_0: \theta \leq \theta_0\) VS \(H_1: \theta > \theta_0\) is a test that the rejection region is

\[R = \{ X_1, ..., X_n | \frac{\bar{X}-\theta_0}{\sigma/\sqrt{n}} > c \}\]

Then, the power function of this test is

\[\beta(\theta) = P_{\theta}(\frac{\bar{X}-\theta_0}{\sigma/\sqrt{n}} > c) = P(Z > c+\frac{\theta_0 - \theta}{\sigma/\sqrt{n}})\]

where \(Z\) is a standard normal random variable.

For a fixed sample size, it is impossible to make both errors arbitrarily small, so it is common to restrict consideration to tests that control the Type I Error at a specified level, and within that level, minimize Type II as much as possible.

point estimation and hypothesis testing

Notice the similarity between point estimation and hypothesis testing

in point estimation, we restrict estimators to a restricted class (unbiased estimator) and then minimize variance within that class
in hypothesis testing, we restrict test to a specfic level-\(\alpha\) test and minimize the Type II error (find the most powerful test) within this class.

Definition (size \(\alpha\) test, alpha \(\alpha\) test)

For \(0 \leq \alpha \leq 1\), a test with power function \(\beta(\theta)\) is called size \(\alpha\) test if \(\sup_{\theta \in \Theta_0} \beta(\theta) = \alpha\). It is called level \(\alpha\) test if \(\sup_{\theta \in \Theta_0} \beta(\theta) \leq \alpha\).

\(\alpha\) is called the level of significance

The previous tests only yields test statistics and general form for rejection regions, but do not lead to one specific test. For example, LRT does not specify \(c\). The restriction to size \(\alpha\) may lead to the choice of \(c\) out of the class of tests.

Definition (unbiased test) A test with power function \(\beta(\theta)\) is called unbiased iff \(\beta(\theta') \geq \beta(\theta'')\) for all \(\theta' \in \Theta_0^c, \theta'' \in \Theta_0\)

2.2.1. UMP Test

Definition (uniformly most powerful) Let \(\mathcal{C}\) be a class of tests for testing \(H_0: \theta \in \Theta_0\) versus \(H_1: \theta \in \Theta_0^c\). A test in class \(\mathcal{C}\), with power function \(\beta(\theta)\) is a uniformly most powerful (UMP) class test if \(\beta(\theta) \geq \beta'(\theta)\) for every \(\theta \in \Theta_0^c\) and every \(\beta'(\theta)\) that is a power function of a test in the class.

Theorem (Neyman-Pearson) Consider testing \(H_0: \theta = \theta_0\) vs \(H_1: \theta = \theta_1\), using a test with rejection region \(R\) that satisfies

\[f(x | \theta_1) > k f(x|\theta_0) \implies x \in R\]

\[f(x | \theta_1) < k f(x|\theta_0) \implies x \in R^c\]

for some \(k \geq 0\) and

\[\alpha = P_{\theta_0}(X \in R)\]

(Sufficiency) Any test that satisfies those conditions are UMP level \(\alpha\) test
(Necessity) If there exists a test satisfying those condition with \(k > 0\), then every UMP level \(\alpha\) test is a size \(\alpha\) test.

Theorem (Karlin-Rubin) Consider testing \(H_0: \theta \leq \theta_0\) vs \(H_1: \theta > \theta_0\). Suppose that \(T\) is a sufficient statistic for \(\theta\) and the family of pdfs \(\{ g(t|\theta): \theta \in \Theta \}\) of \(T\) has an MLR (monontone likelihood ratio) property. Then for any \(t_0\), the test that rejects \(H_0\) iff \(T > t_0\) is a UMP level \(\alpha\) test, where \(\alpha = P_{\theta_0}(T > t_0)\)

2.2.2. P-value

Reporting "rejecting \(H_0\)" or "retain \(H_0\)" is not very informatvie. Instead, we could ask, for every \(\alpha\), whether the test rejects at that level. At \(\alpha=1\) we will always reject and \(\alpha=0\) we will always retain. Therefore, there is a smallest threshold rejection \(\alpha\) and this is the p-value.

\[\text{p-value} = \inf \{ \alpha | T(X) \in R_\alpha \}\]

Intuitively, p-value is the lowest significance level \(\alpha\) that results in rejecting the null hypothesis.

Definition (formal p-value) A p-value \(p(X)\) is a test statistics satisfying \(0 \leq p(x) \leq 1\) for every sample \(x\), small values of \(p(x)\) give evidence that \(H_1\) is true. A p-value is valid iff for every \(\theta \in \Theta_0, 0 \leq \alpha \leq 1\)

\[P_{\theta}(p(X) \leq \alpha) \leq \alpha\]

The most common way to define a valid p-value is

Theorem (p-value) Let \(W(X)\) be a test statistic such that large values of \(W\) give evidence that \(H_1\) is true. For each sample point \(x\), define

\[p(x) = \sup_{\theta \in \Theta_0} P_\theta (W(X) \geq W(x))\]

Then, \(p(X)\) is a valid p-value

Intuitively, this means the p-value is the probability under \(H_0\) of observing a value of the test statistic the same as or more extreme that what was actually observed

2.2.3. Multiple Testing

Definition (Family-Wise Error Rate) the probability that we falsely reject any null hypothesis in multiple tests.

Suppose we want to control the FWER at \(\alpha\) of \(d\) hypothesis test, we can use the following corrections:

Theorem (Sidak correction) we reject any test if the p-value is smaller than

\[\alpha_t = 1-(1-\alpha)^{1/d}\]

If the p-values are all independent, then \(FWER \leq \alpha\)

Theorem (Bonferroni correction) If we do not know the independence, we can use the Bonferroni correction: we reject any test if p-value is smaller than

\[\alpha/d\]

The cost of this protection is they tend to fail to reject false null hypothesis (therefore more Type-2 errors), Bonferroni correction can be improved with the Holm's procedure, it is uniformly more powerful than Bonferroni correction

Theorem (Holm's procedure)

order the p-values \(p_{(1)} \leq p_{(2)} \leq ... \leq p_{(d)}\)
if \(p_{(1)} \leq \alpha/d\), then reject \(H_1\) and move on, otherwise stop and accept all \(H_i\)
if \(p_{(2)} \leq \alpha/(d-1)\), then reject \(H_2\) and move on, otherwise stop and accept \(H_2, ..., H_d\)
if \(p_{(d)} \leq \alpha\), then reject \(H_d\), otherwise accept \(H_d\)

Definition (false discovery rate) Denote the number of false rejections as \(V\), the total number of rejections as \(R\), then the false discovery proportion is

\[FDP = V/R\]

when \(R> 0\), otherwise \(FDP = 0\). the FDR is

\[FDR = E[FDP]\]

Family-Wise Error Rate (FWER) is connected this by

\[FWER = P(V \geq 1)\]

BH procedure controls FDR under independence. It turns out to be very challenging to tightly control FDR under strong dependence.

Theorem (BH procedure)

sort all p-values \(p_{(1)} \leq p_{(2)} \leq ... \leq p_{(d)}\)
define the threshold \(t_i = i\alpha/d\)
find the largest \(i\) such that

\[i_{max} = \text{argmax} \{ i | p_{(i)} \leq t_i \}\]

Reject all nulls upto and include \(i_{max}\)

Lemma (FWER vs FDR) Under the global null, FDR control is equivalent to FWER control, and controling FWER always controls FDR

\[FDR \leq FWER\]

2.3. Specific Tests

The following famous tests can be derived using the previous approaches.

2.3.1. T-tests

Test (two-sample t-test) Let \(X_1, ..., x_n\) be a random sample from a \(n(\mu_x, \sigma_x^2)\), and \(Y_1, ..., Y_m\) be a independent sample from a \(n(\mu_Y, \sigma_Y^2)\). We are interested in testing

\[H_0: \mu_X = \mu_Y \text{ vs } H_1 :\mu_X \neq \mu_Y\]

with the assumption that \(\sigma_X^2 = \sigma_Y^2 = \sigma^2\)

The LRT for these hypotheses: under the \(H_0\), the statistics

\[T = \frac{\bar{X} - \bar{Y}}{\sqrt{S^2_p(\frac{1}{n} + \frac{1}{m})}} \sim t_{n+m-2}\]

where

\[S_p = \frac{1}{n+m-2} (\sum_i (X_i - \bar{X})^2 + \sum_i (Y_i - \bar{Y})^2)\]

Check exercise 8.41 in C&B

Definition (Behrens–Fisher problem) It is not always true to assume the variance is equal, it is called Behrens–Fisher problem in such a case.

Test (Welch's t-test) it gives an approximation answer to the Behrens-Fisher problem

When we do not assume \(\sigma_X^2 = \sigma_Y^2\). the statistic

\[T = \frac{\bar{X} - \bar{Y}}{\sqrt{(\frac{S_X^2}{n} + \frac{S_Y^2}{m})}} \sim t_{\nu}\]

where \(S_X^2 = \frac{1}{n-1} \sum_i (X_i - \bar{X})^2, S_Y^2 = \frac{1}{m-1} \sum_i (Y_i - \bar{Y})^2\)

\[\nu = \frac{(\frac{S_X^2}{n} + \frac{S_Y^2}{m})^2}{\frac{S_X^4}{n^2(n-1)} + \frac{S_Y^4}{m^2(m-1)}}\]

Check exercise 8.42 in C&B

2.3.2. F-tests

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis

2.3.3. Goodness of Fit

Test (G-test) G-test is increasingly being used in situations where \(\chi^2\) test were used. The statistics is

\[G = 2 \sum_i O_i \log(\frac{O_i}{E_i})\]

\(\chi^2\) test is an approximation of G-test using Taylor expansion, it fails when the sample size is small

Test (Pearson \(\chi^2\) test) Consider \(H_0: P=P_0\) vs \(H_1: P \neq P_0\) where \(P_0, P\) are multinomial distribution with \(k\) categories, for \(P_0\) we have \((p_1, ..., p_k)\) where \(\sum_i p_i=1\). Given a sample \(X_1, ..., X_n\), we reduce it to a count vector \(O_1, ..., O_k\) observations and expected vector are \(E_1, ..., E_k\)

\[T(X_1, ..., X_n) = \sum^k_i \frac{(O_i - E_i)^2}{E_i} \sim \chi^2_{k-1}\]

Note it can also be used to test independence.

Even the distribution is not multinomial, we can reduce it to a multinomial distribution by binning and apply this test.

Test (Kolmogorov–Smirnov test) Test the difference between true CDF \(F(x)\) and empirical CDF \(\hat{F}_n(x)\)

\[\hat{T}_{KS} = \sup_x \{ \hat{F}_n(x) - F(x) \}\]

By Glivenko-Cantelli, we know this statistic is small when \(n\) is large enough. This can be used for example, to do the normality test.

Test (Cramer von-Mises) An alternative test is Cramer von-Mises test where the statistics is

\[ \hat{T}_{CvM} = \int_x (\hat{F}_n(x) - F(x))^2 dF\]

2.3.4. Rank Tests

2.3.5. Permutation Test

Test (2-sample permutation test) To decide whether two samples \(X_1, ..., X_n\), \(Y_1, ..., Y_m\) are from the same distribution. We define \(N=m+n\) and consider all \(N!\) permutations. Suppose our test statistics is \(T\), then it will yield \(Y_1, ..., Y_{N!}\). Under the null hypothesis, they has the same distribution, we can evaluate the p-value that

\[\text{p-value}= \frac{1}{N!} \sum_i 1(T_i > T_{obs})\]

In practice, \(N!\) is too large to compute, we randomly permute the data and compute the statistics.

3. Interval Estimation

Interval estimation, or more generally, set estimation, is an inference which states that \(\theta \in C(\mathbf{x})\) where \(C\) is determined by the observed value \(\mathbf{x}\).

Definition (interval estimate) An interval estimate of a parameter \(\theta\) is any pair of functions, \(L(x_1, ..., x_n)\) and \(U(x_1, ..., x_n)\) of a sample that satisfy \(L(\mathbf{x}) \leq U(\mathbf{x})\) for all \(\mathbf{x}\). The following random interval is called an interval estimator

\[[L(\mathbf{X}), U(\mathbf{X})]\]

Definition (coverage probability) For an interval estimator \([L(X), U(X)]\), the coverage proability is the probability that the random interval covers the true parameter \(\theta\)

\[P_\theta( \theta \in [L(X), U(X)])\]

Note that it is actually a statement about random variable \(X\), not \(\theta\). It is equivalent to

\[P_\theta( L(X) \leq \theta, U(X) \geq \theta)\]

The coverage probability might depends on \(\theta\), the guaranteed confidence is the following:

Definition (confidence coefficient) the confidence coefficient of an interval estimator is

\[\inf_\theta P_\theta (\theta \in [L(X), U(X)])\]

Definition (confidence interval) interval estimators, together with the confidence coefficient are known as confidence intervals.

3.1. Methods of Finding Interval Estimators

3.1.1. Inverting a Test Statistic

Theorem (hypothesis test and confidence sets) For each \(\theta_0 \in \Theta\), let \(A(\theta_0)\) be the acceptance region of a level \(\alpha\) test of \(H_0: \theta = \theta_0\). For each \(x\), define a set \(C(x)\) in the parameter space by

\[C(x) = \{ \theta_0 | x \in A(\theta_0) \}\]

Then the random set \(C(X)\) is a \(1-\alpha\) confidence set. Conversely, let \(C(X)\) be a \(1-\alpha\) confidence set. For any \(\theta_0 \in \Theta\), define

\[A(\theta_0) = \{ x | \theta_0 \in C(x) \}\]

Then \(A(\theta_0)\) is the acceptance region of a level \(\alpha\) test of \(H_0: \theta = \theta_0\)

3.1.2. Pivotal Quantities

Definition (pivotal quantity) Let \((X_i)_{i=1}^{n}\) be a random sample from a distribution with parameter \(\theta\) that is to be estimated. The random variable \(Q\) is said to be a pivotal quantity iff:

1) It is a function of \((X_i)_{i=1}^{n}\) and the unknown parameter \(\theta\), but it does not depend on any other params

\[Q = Q(X_1, ..., X_n, \theta)\]

2) The probability distribtuion of \(Q\) does not depend on \(\theta\) or any other unknown params

The steps in the pivotal method for finding confidence interval are:

find a pivotal quantity \(Q(X_1, ..., X_n, \theta)\)
Find an interval for \(Q\) such that \(P(q_l \leq Q \leq q_h) = 1 - \alpha\)
Using algebraic manipulation to obtain \(P(\hat{\Theta}_l \leq \theta \leq \hat{\Theta}_h ) = 1-\alpha\)

normal mean interval estimation with known variance

Let \(X_1, ..., X_n\) be a random sample from \(N(\theta, 1)\), we know that

\[\bar{X} \sim N(\theta, 1/n)\]

and

\[Q = \sqrt{n}(\bar{X}-\theta) \sim N(0, 1)\]

therefore \(Q\) is a pivotal quantity and can be used to find the confidence interval where

\[P(-1.96 \leq \sqrt{n}(\bar{X}-\theta) \leq 1.96) = 0.95\]

\[P(\bar{X} - \frac{1.96}{\sqrt{n}} \leq \theta \leq \bar{X} + \frac{1.96}{\sqrt{n}}) = 0.95\]

4. Reference

[0] CMU 36705 Intermediate Statistics notes

[1] H. Pishro-Nik, "Introduction to probability, statistics, and random processes", available at https://www.probabilitycourse.com, Kappa Research LLC, 2014.

[2] Wasserman, Larry. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.

[3] Casella, George, and Roger L. Berger. Statistical inference. Vol. 2. Pacific Grove, CA: Duxbury, 2002.

[4] Hogg, Robert V., Joseph McKean, and Allen T. Craig. Introduction to mathematical statistics. Pearson Education, 2005.