# 0x403 Parametric Models

- 1. Estimation
- 2. Hypothesis Testing
- 3. ANOVA
- 4. Regression Analysis
- 5. Variable Selections
- 6. Time Series Analysis
- 7. Causal Inference
- 8. Reference

## 1. Estimation

In general, a parameterization for the mean vector \(E(Y)\) consists of writing \(E(Y)\) as a function of some parameters \(\beta\), say

A linear model has the form

A parameterization is identifiable if knowing \(E(Y)\) (i.e. \(f(\beta)\)) tells you the parameter vector \(\beta\)

**Definition (identifiable)** the parameter \(\beta\) is identifiable if for any \(\beta_1, \beta_2\), \(f(\beta_1) = f(\beta_2)\) implies \(\beta_1 = \beta_2\). Moreover a vector-valued function \(g(\beta)\) is identifiable if \(f(\beta_1) = f(\beta_2)\) implies \(g(\beta_1) = g(\beta_2)\).

regression model

In regression model with full rank \(r(X) = p\), then \(X^TX\) is non-singular, \(X\beta_1 = X\beta_2\) implies \(\beta_1 = \beta_2\), identifiability holds

linear functions of parameters that are identifiable are called estimable

**Definition (estimable)** A vector-valued linear function of \(\beta\), say, \(\Lambda^T \beta\) is estimable if \(\Lambda^T \beta = P^T X\beta\) for some matrix \(P\)

**Proposition (estimable and linear unbiased estimator)** \(\lambda^T \beta\) is estimable iff it has linear unbiased estimator: there exists \(\rho\) such that \(E(\rho^T Y) = \lambda^T \beta\) for all \(\beta\)

### 1.1. Ordinary Least Square (OLS)

Consider the least square estimation over the standard linear model

#### 1.1.1. Coefficient Estimation

we know \(E(Y) = X \beta\) where \(\beta\) is undecided, so we know \(E(Y)\) is within \(C(X)\)

**Definition (Least Square Estimate, LSE)** we want to take the vector in \(C(X)\) that is closest to \(Y\). in such a case \(\hat{\beta}\) is called the least square estimate

For a vector \(\Lambda^T \beta\), a least square estimate is defined as \(\Lambda^T \hat{\beta}\)

The LSE in this simple case is also sometimes called the **OLS** or **Ordinary Least Square**

**Theorem (LSE and projection)** \(\hat{\beta}\) is a LSE of \(\beta\) iff

where \(M\) is the perpendicular projection operator onto \(C(X)\)

Note the LSE does not need to be unique if the target is not identifiable

non-unique LSE

Consider the simple case where

Then \(\beta\) is not identifiable, LSE \(\hat{\beta}\) can be, for example, \((0, 1/4)^T\) or \((1/2, 0)^T\)

Recall the projection matrix can be written as

where \(X^g\) is the generalized inverse. While the generalized inverse is not unique (always exists though), the projection \(M\) is unique regardless of which generalized inverse. We can, for example, use the pseudo-inverse \(X^\dagger\)

**Corollary (LSE with pseudo-inverse)** the following is "one" of the LSE of \(\beta\), there might be other LSE

where \(X^\dagger\) is the pseudo-inverse.

While LSE of \(\beta\) might not be unique, the LSE of identifiable/estimable functions are unique.

**Corollary (LSE and estimable function)** Recall \(\lambda^T \beta\) is estimable if \(\lambda^T = \rho^TX\), so we know the following LSE is unique:

Also recall here that \(M\) is unique

**Proposition (LSE are unbiased)** LSE of estimable functions are unbiased: if \(\lambda^T = \rho^T X\)

if \(Y \sim N(X \beta, \sigma^2 I)\), we can show

if \(X^TX\) is nonsingular, then \(\beta\) is estimable and

#### 1.1.2. Variance Estimation

It is obvious the residuals can be written as

It is reasonable to use \((I-M)Y\) to estimate \(\sigma^2\)

**Theorem (unbiased estimate of variance)** Let \(r(X) = r\), \(Cov(e) = \sigma^2 I\), then the following is an unbiased estimate of \(\sigma^2\)

MLE of variance

The LSE estimates are unbiased, but esimtate of \(\sigma^2\) is not

**Definition (SSE, sum of squared error)** The sum of squares of error is

**Defintiion (MSE, mean square error)** defined as

where \(r\) is the rank of \(X\) (and \(M\)). \(n-r=rank(I-M)\) is called the *degrees of freedom for error*, denoted b \(dfE\)

It is an unbiased estimate of \(\sigma^2\)

Suppose \(Y \sim N(X\beta, \sigma^2 I)\), the underlying distribution is

#### 1.1.3. Best Linear Unbiased Estimator

**Definition (Best Linear Unbiased Estimate, BLUE)** \(a^TY\) is called a best linear unbiased estimate of \(\lambda^T \beta\) when \(a^T Y\) is unbiased and if for any other linear unbiased estimate \(b^T Y\)

Gauss-Markov claims the LSE are BLUE in the standard linear model

**Definition (Gauss-Markov)** Recall the standard linear model

If \(\lambda^T \beta\) is estimable, then the least squares estimate of \(\lambda^T \beta\) is a BLUE of \(\lambda^T \beta\)

#### 1.1.4. Minimum Variance Unbiased Estimate (MVUE)

Consider the model is

A vector-valued sufficient statistic \(T(Y)\) is said to be complete if \(E(h(T(Y))) =0\) for all \(\beta, \sigma^2\) implies \(P[h(T(Y)) = 0] = 1\) for all \(\beta, \sigma^2\)

**Theorem (Lehmann–Scheffé theorem, Minimum Variance Unbiased Estimate)** If \(T(Y)\) is a complete sufficient statistic, then \(f(T(Y))\) is a minimum variance unbiased estimate (MVUE) of \(E[f(T(Y))]\)

### 1.2. Generalized Least Square

A more general version of linear model assumes the following formulation:

where \(V\) is some known PSD matrix rather than identity

weighted least square

weighted least square is a special case that \(V\) is a diagonal matrix

The least square here minimizes the following objective:

\(\hat{\beta}\) is a generalized least square estimate iff it satisfies the following:

This criterion is a generalized version of \(X \hat{\beta} = MY\) in OLS

### 1.3. Biased Estimators

#### 1.3.1. Bias-Variance Tradeoff

Consider again the standard linear model

The MSE of OLS can be measured by

It is unbiased estimator, so \(\sigma^2 r(X)\) is the variance of this model

Here consider a reduced model

THE MSE of this model can be decomposed into

where the first term is bias and second term is variance

It is clear thean MSE get improved in the reduced model when \(\|X_\beta - M_0 X \beta \|^2 < \sigma^2(r(X) - r(X_0))\)$

#### 1.3.2. Ridge Regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their size.

An equivalent way to write it is a constrained optimization problem

where there is a 1-to-1 mapping between \(\lambda, t\)

The solution is

Using SVD of \(X = U\Sigma V^T\), we can compare the \(\hat{Y}^\text{LSE}\) and \(\hat{Y}^\text{ridge}\)

where \(d_i\) are singular values of \(X\)

Ridge regression computes the coordinates of \(Y\) with respect to the orthonormal basis \(U\). It then shrinks these coordinates by the factors of \(\frac{d_i^2}{d_i^2 + \lambda}\)

#### 1.3.3. Lasso

## 2. Hypothesis Testing

### 2.1. Model Testing

**Test (t-test)** Assume that \(Y \sim N(X \beta, \sigma^2 I)\), if \(X^TX\) is nonsingular (i.e. \(X\) is full rank), then we know the following sampling distribution of estimators \(\hat{\beta}, \hat{\sigma}^2\):

and

To test a specific coefficient \(\beta_j = 0\), we form the following statistics if \(\sigma\) is unknown, we use estimate \(\hat{\sigma}\) instead"

If \(\sigma\) is known, it becomes a normal distribution

**Test (F-test)** To test a reduction model with a group of variables. Consider a full model

and a reduced model

If the full model is true, then

If the reduced model is true

The last one provides a distribution for the test statistic under the null hypothesis (which is the reduced model), we reject \(H_0\) at level \(\alpha\) if

### 2.2. Linear Parmetric Function Testing

## 3. ANOVA

Analysis of Variance (ANOVA) is about analyzing variances but rather about analyzing variation in means

### 3.1. Oneway Analysis of Variance

**Definition (oneway ANOVA assumption)** Random variables \(Y_{i,j}\) are observed according to the model

where

- \(E \epsilon_{i,j} = 0, \text{Var}(\epsilon_{i,j}) = \sigma_i^2 = \sigma^2 < \infty\)
- \(\text{Cov}(\epsilon_{i,j}, \epsilon_{i', j'}) = 0\)
- \(\epsilon_{i,j}\) are independent and normally distributed

### 3.2. Balanced Two-way ANOVA

The balanced two-way ANOVA without interaction model is generally written a

## 4. Regression Analysis

### 4.1. Simple Linear Regression

Consider the simple model

where \(Y, e\) are random variables, \(X\) is the fixed value and \(\beta\) are (fixed) index or parameter.

The solution is

where \(s_x, s_y\) are sample std, \(r_{x,y}\) are sample correlation coefficient

### 4.2. Sparse Linear Models

See the SLS book and High Dimensional Statistics chapter 7

Consider the linear model

where the number of predictors \(d\) has the same scale of the number of sample size \(n\). It is necessary to impose additional structure on the regression vector \(\theta^*\)

**Definition (hard sparsity)** the support set of \(\theta^*\) has cardinality \(s = |S(\theta^*)|\) substancially less than \(d\). Nonzero elements should be less than \(s\)

**Definition (weakly sparse)** A weak sparsity is to relax the hard sparsity and approximate a sparse vector

In the special case \(q=0\), the weak sparsity becomes the hard sparsity.

#### 4.2.1. Noisyless Inference

#### 4.2.2. Noisy Inference

### 4.3. Bayesian Estimation

In a full Bayesian treatment, the parameter \(w\) has a distribution, the conjugate prior can be given as follows (if variance is known)

#### 4.3.1. Posterior Distribution

With the likelihood,

The posterior is then given by the linear condition model:

where

simple case

Consider \(S_0 = \alpha^{-1}I\), then the posterior becomes

Maximizing this posterior wrt \(w\) is equivalent to the minimization of squre sum error with regularization \(\lambda = \alpha/\beta\)

#### 4.3.2. Predictive Distribution

The predictive distribution becomes

We can again think of it as a linear conditional model (i.e.: \(p(y) = \int p(y|x) p(x)\)), and obtain the predictive distribution

where

#### 4.3.3. Model Evidence

In a full treatment, we introduce prior to hyperparameter \(\alpha, \beta\) as well and marginalize wrt them as well

This integral might be intractable, **empricial bayes** or **evidence approximation** provides a solution by setting \(\alpha, \beta\) to a peaked value \(\hat{\alpha}, \hat{\beta}\)

where \(\hat{\alpha}, \hat{\beta}\) can be obtained b maximizing marginal likelihood or evidence function \(p(\mathbf{t} | \alpha, \beta)\)

#### 4.3.4. Dual Representations

The regularized error function is

By taking derivative of \(w\) to 0, we obtain

where \(a_n = -\frac{1}{\lambda}(w^T\phi(x_n) - t_n)\) and \(\Phi\) is the design matrix.

substituting this into the previous formula, we get the **dual representation**

where \(K\) is the Gram matrix (i.e. \(K_{nm} = \phi(x_n)^T \phi(x_m)\))

Solving this with respect to \(a\), we obtain

and for a new input \(x\)

### 4.4. General Prediction Theroy

By assuming \(X\) is also a random variable, we are interested in picking up a predictor \(f(x)\) such that minimizing the mean square error wrt the joint distribution of \((x,y)\)

It turns out that the best predictor is the conditional expectation of \(y\) given \(x\)

**Theorem (best predictor)** let \(m(x) = E(y|x)\), then for any other predictor \(f(x)\)

thus \(m(x)\) is the best predictor of \(y\)

proof of the best predictor

Consider the usualy decomposition:

Then first two terms are nonnegative and the last term is 0 because

### 4.5. Generalized Linear Models

In a generalized linear model, each \(Y\) is assumed to be generated from a exponential family distribution such that:

where \(g\) is the link function: \(g(\mu) = X\beta\)

## 5. Variable Selections

### 5.1. Criterion

#### 5.1.1. R^2

## 6. Time Series Analysis

Follows the book Time Series Analysis With Applications in R

### 6.1. Stationary Models

#### 6.1.1. ARMA

### 6.2. Nonstationary Models

## 7. Causal Inference

Correlation does not imply causation

**Definition (causal inference framework)**

- there are two possible actions:
**active treatment**(or just treatment), and the other as the**control treatment**. - For each unit and the two treatment, we associate two
**potential outcomes**: \(Y_i(1), Y_i(0)\) (random variables). Every unit only receives one of the two potential outcomes, which is known as the fundamental problem of causal inference

**Definition (average treatment effect)** we can measure the casual association with potential outcomes

Again the problem here is for each \(i\)-th unit, we only observe \(Y(1)\) or \(Y(0)\). What we observe is

where \(W_i\) is the binary treatment random variable

The only quantity we can observe here is

## 8. Reference

[1] Christensen, Ronald. Plane answers to complex questions. Vol. 35. No. 1. New York: Springer, 2002.

[2] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. "The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics." Springer New York (2009). online pdf

[3] Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.