The basic problem of statistical inference is the inverse of probability:

All of Statistics

Given the outcomes, what can we say about the process that generated the data?

**Frequentist Inference**: the unknown quantity is assumed to be a fixed quantity

**Bayesian Inference**: the unknown quantity is assumed to be a random variable

Contents Index

## Sample, Statistic and Statistical Model

Sample is a sequence of iid random variable, statistic is its function.

### Basic Concepts

**Definition (random sample)** The collection of random variables $X_1, X_2, … X_n$ is said to be a random sample of size $n$ if they are iid. The joint pdf or pmf or $X_1, …, X_n$ is given by

$$f(x_1, …, x_n) = f(x_1) f(x_2) … f(x_n) = \prod_{i=1}^n f(x_i)$$

**Definition (statistic, sampling distribution)** The random variable $Y=T(X_1, …, X_n)$ defined over a random sample $X_1, …, X_n$ and a real-valued function $T$ is called a *statistic*, the distribution of a statistic $Y$ is called the *sampling distribution* of $Y$

Note that statistic is a random variable!!

**Lemma (linearity over expectation and variance) **Let $X_1, …, X_n$ be a random sample and $g(x)$ be a function such that $Eg(X)$ and $Var(g(X)$ exists then

$$E(\sum_{i=1}^n g(X_i)) = n E(g(X_1)) $$

$$\mathrm{Var}(\sum_{i=1}^{n} g(X_i)) = n (\mathrm{Var}(g(X_1)))$$

Note: the second one is proved using zero covariance between iid variables

**Definition (statistical model)** A statistical model is a set of distributions.

**Definition (parametric, non parametric)** A parametric model is a set of statistical models that can be parameterized by a finite number of parameters (e.g: normal distribution). A nonparametric model is a set that cannot be.

### Sample Mean, Variance

**Definition (sample mean)** The *sample mean* is the statistic defined by

$$\bar{X} = \frac{X_1 + … + X_n}{n}$$

**Lemma (mean, variance of sample mean)** Let $X_1, …, X_n$ be a random sample with mean $\mu$ and variance $\sigma^2 < \infty $

$$E\bar{X} = \mu$$

$$\mathrm{Var}(\bar{X}) = \frac{\sigma^2}{n}$$

It is important to note that the dispersion of the sample mean $\hat{X}$ of Cauchy distribution $Cauchy(\mu, \sigma)$, which is measured by $\sigma$ is invariant.

**Definition (sample variance)** The *sample variance* is the statistic defined by

$$S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (X_i – \bar{X})^2$$

**Lemma (mean, variance of sample variance) **

$$ES^2 = \frac{\sigma^2}{n}$$

$$\text{Var}(S^2) = \frac{1}{n} (\theta_4 – \frac{n-3}{n-1}\theta^2_2$$

where $theta_i$ is the i-th central moment.

In the case of normal distribution, the sample mean and variance have several good properties as follows:

**Proposition (sample mean, variance of normal distribution)**

Let $X_1, …, X_n$ be a random sample from a $n(\mu, \sigma^2)$ distribution. Then

- $\bar{X}, S^2$ are independent
- $\bar{X}$ has a $n(\mu, \sigma^2)$ distribution
- $(n-1)S^2/\sigma^2$ has a chi squared distribution with $n-1$ degrees of freedom

In general, $\bar{X}, S^2$ are not always independent, it requires the 3rd central moment to be 0 to be independent.

### Sample Order

**Definition (order statistics) **The *order statistics* of a random sample $X_1, …, X_n$ are sample values placed in ascending order. They are denoted by $X_{(1)}, …, X_{(n)}$.

Some common statistics related are *sample range*, $R=X_{(n)}-X_{(1)}$, *sample median*, *lower quartile* and *upper quartile*.

**Proposition (discrete order statistics) **Let $X_1, …, X_n$ be a random sample from a discrete distribution with pmf $f_X(x_i) = p_i$ where $x_1 < x_2 < …$ are possible values of $X$ in ascending order. Let $P_i = \sum_{k=1}^i p_k$, then cdf of order statistics is

$$P(X_{(j)} \leq x_i) = \sum_{k=j}^n \binom{n}{k} P_i^k (1-P_i)^{(n-k)}$$

and the pmf is

$$P(X_{(j)} = x_i) = \sum_{k=j}^n \binom{n}{k} [ P_i^k (1-P_i)^{(n-k)} – P_{i-1}^k (1-P_{i-1})^{(n-k)}]$$

**Proposition (continuous order statistics) **Consider the continuous population with cdf $F_X(x)$ and pdf $f_X(x)$, then the cdf is

$$F_{X_{(j)}} (x) = \sum_{k=j}^n \binom{n}{k} [F_X(x)]^k [1-F_X(x)]^{n-k}$$

By differentiating this one, we can get

$$f_{X_(j)}(x) = \frac{n!}{(j-1)!(n-j)!} f_X(x) [F_X(x)]^{j-1} [1-F_X(x)]^{n-j}$$

## Data Reduction

Data reduction in terms of a particular statistic $T$ can be thought of as a partition of the sample space into $A_t = \{ x | T(x)=t \}$. The partition should not discard important information about the unknown parameter $\theta$. Instead of reporting $x$, we only report its partition $T(x)$ (whose size is much smaller than $x$)

### The sufficiency principle

A sufficient statistic for a parameter $\theta$ is a statistic that captures all the information about $\theta$ in the sample.

**Principle (sufficiency) **Consider an experiment $E=(X, \theta, f(x|\theta))$ and suppose $T(X)$ is a sufficient statistics for $\theta$. If $x, y$ are sample points satisfying $T(x)=T(y)$, then conclusion drawn from $x,y$ should be identical. Any sample within the same partition should be identical.

**Definition (sufficient statistic)** A statistic $T(\mathbf{X})$ is a sufficient statistic for $\theta$ if the conditional distribution of the sample $\mathbf{X}$ given the value of $T(\mathbf{X})$ does not depend on $\theta$

If $\mathbf{x}, \mathbf{y}$ are two sample points such that $T(\mathbf{x}) = T(\mathbf{y})$, then the inference about $\theta$ should be the same whether $\mathbf{X}=\mathbf{x}$ or $\mathbf{X}=\mathbf{y}$ is observed

It turns out that outside of the exponential family of distributions, it is rare to have a sufficient statistic of smaller dimension than the size of the sample.

**Criterion (Fisher-Neyman factorization) **Let $f(\mathbf{x}|\theta)$ denote the joint pdf of a sample $\mathbf{X}$. A statistic $T(\mathbf{X})$ is a sufficient statistic for $\theta$ iff

$$f(\mathbf{x}|\theta) = g(T(\mathbf{x}) | \theta) h(\mathbf{x})$$

**Criterion (condition of sufficient statistics)** If the ratio $\frac{P(\mathbf{x}|\theta)}{q(T(\mathbf{x})|\theta)}$ is constant as a function of $\theta$, then $T(\mathbf{X})$ is a sufficient statistic for $\theta$

Sufficient statistic can be interpreted with the concept of sufficient partition

**Definition (sufficient partition)** A partition $B_1, …, B_k$ is called sufficient partition if $f(x|X \in B)$ does not depend on $\theta$.

A statistic $T$ induces a partition. $T$ is sufficient iff its partition is sufficient. If we get finer partition from one sufficient partition, it is also sufficient partition (statistics).

There are more than one sufficient statistics, the sufficient statistics which achieve the most data reduction is the minimal sufficient statistic.

Their partition is corresponding to the most coarse sufficient partition. A more coarse partition will remove the sufficiency (it will depend on $\theta$)

**Definition (minimal sufficient statistics)** A sufficient statistics $T(\mathbf{X})$ is called a minimal sufficient statistics if, for any other sufficient statistics $T'(\mathbf{X})$, $T(x)$ is a function of $T'(x)$

**Criterion (condition of minimality) **$T(\mathbf{X})$ is a minimal sufficient statistics for $\theta$ when

the ratio $f(x|\theta)=f(y|\theta)$ is constant as a function of $\theta$ iff $T(x)=T(y)$

### The Likelihood Principle

**Principle (likelihood)** If $x,y$ are two sample points that their likelihood function are proportional, then the conclusion drawn from $x,y$ should be identical

**Definition (likelihood function)** Let $f(x|\theta)$ denoted the joint pdf or pmf of the sample $X$, given that $X=x$ is observed, the function of $\theta$ defined as follows is called the likelihood function

$$L(\theta|x) = f(x|\theta)$$

## Point Estimation

Point estimation consists of two part: how to find a point estimator, how to evaluate them

### Introduction

**Definition (point estimator)** A point estimator is any function $W(X_1, X_2, …, X_n)$ of a sample; that is, any statistic is a point estimator.

An *estimator* is a function of the sample, an *estimate* is the realized value of estimator. There might be natural candidates for point estimators, but not always follow our intuitions

### Methods of Finding Estimators

#### Method of Moments

Equating the sample moments to the corresponding population moments and solve the resulting system.

The method is perhaps the oldest method of finding point estimators, it is a good start place when other methods prove intractable. Also it can be applied to obtain approximations to the distributions of statistics. (Satterthwaite approximation)

#### Maximum Likelihood Estimators

**Definition (maximum likelihood estimator)** Maximum likelihood estimator $\hat{\theta}(\mathbf{X})$ is the parameter $\theta$ to maximize $L(\theta | \mathbf{X})$

Drawbacks of MLE is

- finding and verifying the global maximum is difficult
- numerical sensitivity (easy to overfitting, high variance).

How to solve these two problems?

- To solve the first one, we can apply the numerical approach
- The Bayesian approach might be helpful to solve this issue or try to scale up the dataset…

For example, in the Markov language model, MLE language model have the zero count issue, therefore smoothing is necessary. This is a Bayesian approach: Laplace smoothing can be interpreted as a MAP with a uniform prior.

The advantage of MLE is

- MLE might be maximized numerically if likelihood can be written down.
- invariance property

**Proposition (properties of MLE)**

- $\hat{\Theta}_{ML}$ is asymptotically consistent
- $\hat{\Theta}_{ML}$ is asymptotically unbiased
- $\hat{\Theta}_{ML}$ is approximately a normal random variable

#### Bayes Estimators

In the classical approach, the parameter $\theta$ is thought to be an unknown, but fixed quantity. In the Bayesian approach, $\theta$ is considered to be a random variable from a probability distribution $\pi(\theta)$(prior distribution), which is subjective distribution by the experimenter’s belief.

The distribution is updated into posterior distribution $\pi(\theta|x)$ based on sample observed

$$\pi(\theta|x) =\frac{f(x|\theta) \pi(\theta)}{\int f(x|\theta)\pi(\theta) d\theta}$$

Note that there are lots of ways to get Bayes estimators. For example,

One way to compute Bayes point estimator is to take its mean

$$\hat{\theta} = E[\pi(\theta|x)]$$

**Definition (MAP) **Another is to get the mode, which is the **maximum a posteriori probability** (**MAP**) **estimate**

$$\hat{\theta} = \mathrm{argmax}_\theta \pi(\theta|x)$$

### Algorithms of Finding estimators

In some cases, we can find MLE analytically, but more often we need to find it by numerical methods

#### Newton-Raphson

#### The EM algorithm

EM is an algorithm which is guaranteed to converge to the MLE or MAP

### Methods of Evaluating Estimators

The general topic of evaluating statistical procedures is the branch known as *decision theory*

#### Mean Squared Error

**Definition (mean squared error)** The mean squared error (MSE) of an estimator $W$ of a parameter $\theta$ is the function of $\theta$ defined by $E_{\theta}(W-\theta)^2$

$$E_{\theta} (W(x1, …, x_n) – \theta)^2 = \int … \int (W(x_1, …, x_n) – \theta)^2 f(x_1|\theta) … f(x_n|\theta) dx_1 … dx_n$$

**Lemma (bias-variance decomposition) **The mean square error have a bias-variance decomposition as follows:

$$E_{\theta} (W-\theta)^2 = (E_\theta W – theta)^2 + Var_\theta(W)$$

In ML terms (Andrew Ng’ lecture)

**Bias**is an error from the algorithm/estimator itself, it is corresponding to underfitting (high error in training set)**Variance**is an error from sensitivity to small fluctuations in the training set, which causes the bad performance in dev set. It is corresponding to overfitting (high error in test set).

**Definition (bias)** The bias of a point estimator $W$ of a parameter $\theta$ is the difference between the expected value of $W$ and $\theta$; that is

$$Bias_{\theta}W = E_{\theta}W – \theta$$

An estimator whose bias is identically equal to 0 is called unbiased and satisfies $E_{\theta} W = \theta$ for all $\theta$

#### Best Unbiased Estimator

**Definition** **(unbiased best estimator)** An esitmator $W^*$ is a best unbiased estimator of $\tau(\theta)$ if it satisfies $(\forall \theta) E_\theta W^* = \tau(\theta)$. For any other estimator $W$ with $E_\theta(W) = \tau(\theta)$ we have $(\forall \theta) Var_{\theta} W^* \leq Var_{\theta} W$ $W^*$ is also called a uniform minimum variance unbiased estimator.

Candidates of best estimators can be infinitely many, so it might be hard to verify that an estimator is the best one. The following inequality can guarantee the found estimator is the best one

**Theorem (Cramer-Rao Lower bound)** Let $X_1, …, X_n$ be a sample of pdf $f(x|theta)$, and let $W(X)=W(X_1, …, X_n)$ be any estimator satisfying

$$ \frac{d}{d\theta} E_\theta W(X) = \int \frac{\partial}{\partial \theta} [W(x)f(x|\theta)] dx$$

and

$$Var_\theta W(X) < \infty$$

then

$$Var_\theta (W(X)) \geq \frac{(\frac{d}{d\theta} E_\theta W(X))^2}{E_\theta((\frac{\partial}{\partial \theta} log(f(X|\theta)))^2)}$$

**Definition (Fisher information)** The following quantity is called the *Fisher information*

$$E_\theta((\frac{\partial}{\partial \theta} log(f(X|\theta)))^2$$

The bigger Fisher information indicates more information about $\theta$, therefore small variance of the best unbiased estimator

**Theorem (Rao-Blackwell) **Let $W$ be an unbiased estimator for $\tau(\theta)$, and $T$ be a sufficient statistic for $\theta$. Then the following $\phi(T)$ is a uniformly better unbiased estimator of $\tau(\theta)$

$\phi(T)=E(W|T)$

### Decision Theory

Mean square error loss is a special case of *loss function*, the study of the performance of estimators using loss function is a branch of decision theory

**Definition (risk function)** The quality of an estimator is quantified in its risk function wrt estimator $\delta(X)$

$$R(\theta, \delta) = E_{\theta} L(\theta, \delta(X))$$

For square error loss, the risk function is mean square error (MSE) where

$$\R(\theta, \delta) = \mathrm{Var}_\theta \deltah(X) + (\mathrm{Bias} \delta(X))^2$$

## Hypothesis Testing

Like point estimation, hypothesis testing is another statistical inference method.

**Definition (hypothesis)** A *hypothesis* is a statement about a population parameter. The complementary hypotheses are called *null hypothesis* and *alternative hypothesis*, denoted by $H_0, H_1$ respectively.

The general form of hypothesis about $\theta$ is that $H_0 = \Theta_0$ and $H1 = \Theta_0^c$

**Definition (hypothesis test)** A hypothesis test is a rule that specifies

- For which sample value the decision is made to accept $H_0$
- For which sample value $H_0$ is rejected and $H_1$ is accepted as true

Typically, a hypothesis test is specified in terms of a test statistic

### Methods of Finding Tests

#### Likelihood Ratio Tests

**Definition (likelihood ratio test statistic, LRT) **The *likelihood ratio test statistic* for testing $H_0 : \theta \in \Theta_0$ versus $H_1: \theta \in \Theta_0^c$ is

$$\lambda(x) = \frac{\sup_{\Theta_0} L(\theta | \mathbf{x})}{\sup_{\Theta} L(\theta | \mathbf{x})}$$

A *likelihood ratio test (LRT)* is any test that has a rejection region of the form $\{ \mathbf{x} : \lambda(\mathbf{x}) \leq c \}$ where $0 \leq c \leq 1$

**Theorem** If $T(\mathbf{X})$ is a sufficient statistic for $\theta$ and $\lambda^{*}(t), \lambda(\mathbf{x})$ are the LRT statistics based on $T,\mathbf{X}$, then $\lambda^{*}(T(\mathbf{x}))=\lambda(\mathbf{x})$ for every $\mathbf{x}$ in the sample space

### Methods of Evaluating Tests

**Definition (power function) **Suppose $R$ denotes the reject region for a test. Then the power function of this test is the function of $\theta$ defined by

$$\beta(\theta) = P_\theta (X \in R)$$

Ideally, we would like $\beta(\theta)=0$ when $\theta \in \Theta_0$ (minimize False Negative, Type I Error) and $\beta(\theta)=1$ when $\theta \in \Theta_0^c$ (minize False Positive, Type II Error)

Typically, the power function of a test will depend on the sample size $n$, therefore by considering the power function, the experimenter can choose $n$ to achieve some test goal.

**Definition (size $\alpha$ test, alpha $\alpha$ test)**.

For $0 \leq \alpha \leq 1$, a test with power function $\beta(\theta)$ is called size $\alpha$ test if $sup_{\theta \in \Theta_0} \beta(\theta) = \alpha$. It is called level $\alpha$ test if $sup_{\theta \in \Theta_0} \beta(\theta) \leq \alpha$

The previous tests only yields test statistics and general form for rejection regions, but do not lead to one specific test. For example, LRT does not specify $c$. The restriction to size $\alpha$ may lead to the choice of $c$ out of the class of tests.

**Definition (unbiased test)** A test with power function $\beta(\theta)$ is called unbiased iff $\beta(\theta’) \geq \beta(\theta”)$ for all $\theta’ \in \Theta_0^c, \theta” \in \Theta_0$

**Definition (uniformly most powerful)** Let $\mathcal{C}$ be a class of tests for testing $H_0: \theta \in \Theta_0$ versus $H_1: \theta \in \Theta_0^c$. A test in class $\mathcal{C}$, with power function $\beta(\theta)$ is a uniformly most powerful (UMP) class test if $\beta(\theta) \geq \beta'(\theta)$ for every $\theta \in \Theta_0^c$ and every $\beta'(\theta)$ that is a power function of a test in the class.

**Theorem (Neyman-Pearson) **Consider testing $H_0: \theta = \theta_0$ vs $H_1: \theta = \theta_1$, using a test with rejection region $R$ that satisfies

$$f(x | \theta_1) > k f(x|\theta_0) \implies x \in R$$

$$f(x | \theta_1) < k f(x|\theta_0) \implies x \in R^c$$

for some $k \geq 0$ and

$$\alpha = P_{\theta_0}(X \in R)$$

(Sufficiency) Any test that satisfies those conditions are UMP level $\alpha$ test

(Necessity) If there exists a test satisfying those condition with $k > 0$, then every UMP level $\alpha$ test is a size $\alpha$ test.

**Definition (p-value) **A p-value $p(X)$ is a test statistics satisfying $0 \ leq p(x) \leq 1$ for every sample $x$, small values of $p(x)$ give evidence that $H_1$ is true. A p-value is valid iff for every $\theta \in \Theta_0, 0 \leq \alpha \leq 1$

$$P_{\theta}(p(X) \leq \alpha) \leq \alpha$$

## Interval Estimation

**Definition (interval estimation)** Let $X_1, X_2, X_3, …, X_n$ be a random sample from a distribution with a parameter $\theta$. An interval estimator with confidence level $1-\alpha$ consists of two estimators $\hat{\Theta_l}(X_1, …, X_n)$ and $\hat{\Theta_h}(X_1, …, X_n)$ such that

$$P($\hat{\Theta_l}(X_1, …, X_n) \leq \theta \land $\hat{\Theta_h}(X_1, …, X_n) \geq \theta) \geq 1 – \alpha$$

**Definition (pivotal quantity)** Let $(X_i)_{i=1}^{n}$ be a random sample from a distribution with parameter $\theta$ that is to be estimated. The random variable $Q$ is said to be a *pivotal quantity* iff:

- It is a function of $(X_i)_{i=1}^{n}$ and the unknown parameter $\theta$, but it does not edepend on any other params
- The probability distribtuion of $Q$ does not depend on $\theta$ or any other unknown params

## Reference

[1] H. Pishro-Nik, “Introduction to probability, statistics, and random processes”, available at https://www.probabilitycourse.com, Kappa Research LLC, 2014.

[2] Wasserman, Larry. *All of statistics: a concise course in statistical inference*. Springer Science & Business Media, 2013.

[3] Casella, George, and Roger L. Berger. *Statistical inference*. Vol. 2. Pacific Grove, CA: Duxbury, 2002.