0x403 Parametric Models

1. Estimation
2. Hypothesis Testing
- 2.1. Model Testing
- 2.2. Linear Parmetric Function Testing
3. ANOVA
- 3.1. Oneway Analysis of Variance
- 3.2. Balanced Two-way ANOVA
4. Regression Analysis
5. Variable Selections
- 5.1. Criterion
  - 5.1.1. R^2
6. Time Series Analysis
- 6.1. Stationary Models
  - 6.1.1. ARMA
- 6.2. Nonstationary Models
7. Causal Inference
8. Reference

1. Estimation

In general, a parameterization for the mean vector $E(Y)$ consists of writing $E(Y)$ as a function of some parameters $\beta$, say

\[E(Y) = f(\beta)\]

A linear model has the form

\[E(Y) = X\beta\]

A parameterization is identifiable if knowing $E(Y)$ (i.e. $f(\beta)$) tells you the parameter vector $\beta$

Definition (identifiable) the parameter $\beta$ is identifiable if for any $\beta_1, \beta_2$, $f(\beta_1) = f(\beta_2)$ implies $\beta_1 = \beta_2$. Moreover a vector-valued function $g(\beta)$ is identifiable if $f(\beta_1) = f(\beta_2)$ implies $g(\beta_1) = g(\beta_2)$.

regression model

In regression model with full rank $r(X) = p$, then $X^TX$ is non-singular, $X\beta_1 = X\beta_2$ implies $\beta_1 = \beta_2$, identifiability holds

linear functions of parameters that are identifiable are called estimable

Definition (estimable) A vector-valued linear function of $\beta$, say, $\Lambda^T \beta$ is estimable if $\Lambda^T \beta = P^T X\beta$ for some matrix $P$

Proposition (estimable and linear unbiased estimator) $\lambda^T \beta$ is estimable iff it has linear unbiased estimator: there exists $\rho$ such that $E(\rho^T Y) = \lambda^T \beta$ for all $\beta$

1.1. Ordinary Least Square (OLS)

Consider the least square estimation over the standard linear model

\[Y=X\beta + e, E(e) = 0, Cov(e) = \sigma^2 I\]

1.1.1. Coefficient Estimation

we know $E(Y) = X \beta$ where $\beta$ is undecided, so we know $E(Y)$ is within $C(X)$

Definition (Least Square Estimate, LSE) we want to take the vector in $C(X)$ that is closest to $Y$. in such a case $\hat{\beta}$ is called the least square estimate

\[\hat{\beta} = \text{argmin}_{\beta} (Y - X \beta)^T(Y - X \beta)\]

For a vector $\Lambda^T \beta$, a least square estimate is defined as $\Lambda^T \hat{\beta}$

The LSE in this simple case is also sometimes called the OLS or Ordinary Least Square

Theorem (LSE and projection) $\hat{\beta}$ is a LSE of $\beta$ iff

\[X\hat{\beta} = MY\]

where $M$ is the perpendicular projection operator onto $C(X)$

Note the LSE does not need to be unique if the target is not identifiable

non-unique LSE

Consider the simple case where

\[X = \begin{pmatrix} 1 & 2 \\ 1 & 2 \end{pmatrix}, Y = \begin{pmatrix} 0 \\ 1 \end{pmatrix}\]

Then $\beta$ is not identifiable, LSE $\hat{\beta}$ can be, for example, $(0, 1/4)^T$ or $(1/2, 0)^T$

Recall the projection matrix can be written as

\[M = X(X^TX)^gX^T\]

where $X^g$ is the generalized inverse. While the generalized inverse is not unique (always exists though), the projection $M$ is unique regardless of which generalized inverse. We can, for example, use the pseudo-inverse $X^\dagger$

Corollary (LSE with pseudo-inverse) the following is "one" of the LSE of $\beta$, there might be other LSE

\[\hat{\beta} = (X^TX)^{\dagger} X^T Y\]

where $X^\dagger$ is the pseudo-inverse.

While LSE of $\beta$ might not be unique, the LSE of identifiable/estimable functions are unique.

Corollary (LSE and estimable function) Recall $\lambda^T \beta$ is estimable if $\lambda^T = \rho^TX$, so we know the following LSE is unique:

\[\widehat{\rho^TX\beta} = \rho^T MY\]

Also recall here that $M$ is unique

Proposition (LSE are unbiased) LSE of estimable functions are unbiased: if $\lambda^T = \rho^T X$

\[E(\lambda^T \hat{\beta}) = E(\rho^T MY) = \lambda^T \beta\]

if $Y \sim N(X \beta, \sigma^2 I)$, we can show

\[\Lambda^T \hat{\beta} \sim N(\Lambda^T \beta, \sigma^2 \Lambda^T (X^T X)^g \Lambda)\]

if $X^TX$ is nonsingular, then $\beta$ is estimable and

\[\hat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1})\]

1.1.2. Variance Estimation

It is obvious the residuals can be written as

\[\hat{e} = Y - X \hat{\beta} = (I-M)Y\]

It is reasonable to use $(I-M)Y$ to estimate $\sigma^2$

Theorem (unbiased estimate of variance) Let $r(X) = r$, $Cov(e) = \sigma^2 I$, then the following is an unbiased estimate of $\sigma^2$

\[\frac{Y^T(I-M)Y}{n-r}\]

MLE of variance

The LSE estimates are unbiased, but esimtate of $\sigma^2$ is not

\[\hat{\sigma}_2 = \frac{Y^T (I-M)Y}{n}\]

Definition (SSE, sum of squared error) The sum of squares of error is

\[\hat{e}^T\hat{e} = Y^T(I-M)^T(I-M)Y = Y^T(I-M)Y\]

Defintiion (MSE, mean square error) defined as

\[\frac{Y^T(I-M)Y}{n-r}\]

where $r$ is the rank of $X$ (and $M$). $n-r=rank(I-M)$ is called the degrees of freedom for error, denoted b $dfE$

It is an unbiased estimate of $\sigma^2$

Suppose $Y \sim N(X\beta, \sigma^2 I)$, the underlying distribution is

\[Y^T(I-M)Y/\sigma^2 \sim \chi^2(n-r)\]

1.1.3. Best Linear Unbiased Estimator

Definition (Best Linear Unbiased Estimate, BLUE) $a^TY$ is called a best linear unbiased estimate of $\lambda^T \beta$ when $a^T Y$ is unbiased and if for any other linear unbiased estimate $b^T Y$

\[Var(a^TY) \leq Var(b^TY)\]

Gauss-Markov claims the LSE are BLUE in the standard linear model

Definition (Gauss-Markov) Recall the standard linear model

\[Y = X\beta + e, E(e) = 0, Cov(e) = \sigma^2 I\]

If $\lambda^T \beta$ is estimable, then the least squares estimate of $\lambda^T \beta$ is a BLUE of $\lambda^T \beta$

1.1.4. Minimum Variance Unbiased Estimate (MVUE)

Consider the model is

\[Y = X \beta + e, e \sim N(0, \sigma^2 I)\]

A vector-valued sufficient statistic $T(Y)$ is said to be complete if $E(h(T(Y))) =0$ for all $\beta, \sigma^2$ implies $P[h(T(Y)) = 0] = 1$ for all $\beta, \sigma^2$

Theorem (Lehmann–Scheffé theorem, Minimum Variance Unbiased Estimate) If $T(Y)$ is a complete sufficient statistic, then $f(T(Y))$ is a minimum variance unbiased estimate (MVUE) of $E[f(T(Y))]$

1.2. Generalized Least Square

A more general version of linear model assumes the following formulation:

\[Y=X\beta + e, E(e) = 0, Cov(e) = \sigma^2 V\]

where $V$ is some known PSD matrix rather than identity

weighted least square

weighted least square is a special case that $V$ is a diagonal matrix

The least square here minimizes the following objective:

\[\| Y - X\beta \|^2_{V^{-1}}\]

$\hat{\beta}$ is a generalized least square estimate iff it satisfies the following:

\[X \hat{\beta} = X (X^TV^{-1}X)^{g} X^T V^{-1}Y\]

This criterion is a generalized version of $X \hat{\beta} = MY$ in OLS

1.3. Biased Estimators

1.3.1. Bias-Variance Tradeoff

Consider again the standard linear model

\[Y = X\beta + e, E(e) = 0, Cov(e) = \sigma^2 I\]

The MSE of OLS can be measured by

\[E[(X\hat{\beta} - X \beta)^T(X\hat{\beta} - X \beta)] = E[(Y-X\beta)^T M(Y-X\beta)] = \sigma^2 r(X)\]

It is unbiased estimator, so $\sigma^2 r(X)$ is the variance of this model

Here consider a reduced model

\[Y = X_0 \gamma + e \text{ where } C(X_0) \subset C(X)\]

THE MSE of this model can be decomposed into

\[E \|M_0Y - X\beta \|^2 = \|X \beta - M_0 X \beta \|^2 + \sigma^2 r(X_0)\]

where the first term is bias and second term is variance

It is clear thean MSE get improved in the reduced model when $\|X_\beta - M_0 X \beta \|^2 < \sigma^2(r(X) - r(X_0))$$

1.3.2. Ridge Regression

Ridge regression shrinks the regression coefficients by imposing a penalty on their size.

\[\hat{\beta}^\text{ridge} = \text{argmin}_\beta (Y - X \beta)^T(Y - X \beta) + \lambda \| \beta \|_2^2\]

An equivalent way to write it is a constrained optimization problem

\[\hat{\beta}^\text{ridge} = \text{argmin}_\beta (Y - X \beta)^T(Y - X \beta), \text{ subject to } \| \beta \|^2_2 \leq t\]

where there is a 1-to-1 mapping between $\lambda, t$

The solution is

\[\hat{\beta}^\text{ridge} = (X^TX + \lambda I)^{-1}X^TY\]

Using SVD of $X = U\Sigma V^T$, we can compare the $\hat{Y}^\text{LSE}$ and $\hat{Y}^\text{ridge}$

\[\hat{Y}^\text{LSE} = X \hat{\beta} = X(X^TX)^-1 X^TY = UU^TY\]

\[\hat{Y}^\text{ridge} = X(X^TX + \lambda I)^-1 X^TY = \sum_i u_i \frac{d_i^2}{d_i^2 + \lambda} u_i^TY\]

where $d_i$ are singular values of $X$

Ridge regression computes the coordinates of $Y$ with respect to the orthonormal basis $U$. It then shrinks these coordinates by the factors of $\frac{d_i^2}{d_i^2 + \lambda}$

1.3.3. Lasso

2. Hypothesis Testing

2.1. Model Testing

Test (t-test) Assume that $Y \sim N(X \beta, \sigma^2 I)$, if $X^TX$ is nonsingular (i.e. $X$ is full rank), then we know the following sampling distribution of estimators $\hat{\beta}, \hat{\sigma}^2$:

\[\hat{\beta} \sim N(\beta, \sigma^2 (X^TX)^{-1})\]

and

\[(n-p)\hat{\sigma}^2 / \sigma^2 \sim \chi^2(n-p)\]

To test a specific coefficient $\beta_j = 0$, we form the following statistics if $\sigma$ is unknown, we use estimate $\hat{\sigma}$ instead"

\[\frac{\hat{\beta_j}}{\hat{\sigma}\sqrt{diag(X^TX)^{-1}_j}} \sim t_{n-p}\]

If $\sigma$ is known, it becomes a normal distribution

Test (F-test) To test a reduction model with a group of variables. Consider a full model

\[Y = X\beta + e, e \sim N(0, \sigma^2 I)\]

and a reduced model

\[Y = X_0 \gamma + e, e \sim N(0, \sigma^2I), C(X_0) \subset C(X)\]

If the full model is true, then

\[\frac{Y^T(M-M_0)Y/r(M-M_0)}{Y^T(I-M)Y/r(I_M)} \sim F(r(M-M_0), r(I-M), \beta^TX^T(M-M_0)X\beta/2\sigma^2)\]

If the reduced model is true

\[\frac{Y^T(M-M_0)Y/r(M-M_0)}{Y^T(I-M)Y/r(1-M)} \sim F(r(M-M_0), r(I-M), 0)\]

The last one provides a distribution for the test statistic under the null hypothesis (which is the reduced model), we reject $H_0$ at level $\alpha$ if

\[\frac{Y^T(M-M_0)Y/r(M-M_0)}{Y^T(I-M)Y/r(1-M)} > F(1 - \alpha, r(M-M_0), r(I-M))\]

2.2. Linear Parmetric Function Testing

3. ANOVA

Analysis of Variance (ANOVA) is about analyzing variances but rather about analyzing variation in means

3.1. Oneway Analysis of Variance

Definition (oneway ANOVA assumption) Random variables $Y_{i,j}$ are observed according to the model

\[Y_{i,j} = \theta_i + \epsilon_{i,j}\]

where

$E \epsilon_{i,j} = 0, \text{Var}(\epsilon_{i,j}) = \sigma_i^2 = \sigma^2 < \infty$
$\text{Cov}(\epsilon_{i,j}, \epsilon_{i', j'}) = 0$
$\epsilon_{i,j}$ are independent and normally distributed

3.2. Balanced Two-way ANOVA

The balanced two-way ANOVA without interaction model is generally written a

\[y_{ijk} = \mu + \alpha_i + \eta_j + e_{ijk}\]

4. Regression Analysis

4.1. Simple Linear Regression

Consider the simple model

\[Y = X\beta + e\]

where $Y, e$ are random variables, $X$ is the fixed value and $\beta$ are (fixed) index or parameter.

The solution is

\[\hat\beta = \frac{cov(x,y)}{s_x^2} = r_{x,y}\frac{s_y}{s_x}\]

where $s_x, s_y$ are sample std, $r_{x,y}$ are sample correlation coefficient

4.2. Sparse Linear Models

See the SLS book and High Dimensional Statistics chapter 7

Consider the linear model

\[y = X\theta^* + w\]

where the number of predictors $d$ has the same scale of the number of sample size $n$. It is necessary to impose additional structure on the regression vector $\theta^*$

Definition (hard sparsity) the support set of $\theta^*$ has cardinality $s = |S(\theta^*)|$ substancially less than $d$. Nonzero elements should be less than $s$

\[S(\theta^*) = \{ j \in \{ 1, 2, ..., d\} | \theta^*_j \neq 0 \}\]

Definition (weakly sparse) A weak sparsity is to relax the hard sparsity and approximate a sparse vector

\[B_q(R_q) = \{ \theta \in R^d | \sum_j |\theta_j|^q \leq R_q \}\]

In the special case $q=0$, the weak sparsity becomes the hard sparsity.

4.2.1. Noisyless Inference

4.2.2. Noisy Inference

4.3. Bayesian Estimation

In a full Bayesian treatment, the parameter $w$ has a distribution, the conjugate prior can be given as follows (if variance is known)

\[p(w) = \mathcal{N}(m_0, S_0)\]

4.3.1. Posterior Distribution

With the likelihood,

\[p(t|w) = \prod_{n=1}^N \mathcal{N}(t_n | w^T \phi(x_n), \beta^{-1})\]

The posterior is then given by the linear condition model:

\[p(w | t) = N(w | m_N, s_N)\]

where

\[m_N = S_N(S_0^{-1} m_0 + \beta \Phi^T t)\]

\[S_N^{-1} = S_0^{-1} + \beta \Phi^T\Phi\]

simple case

Consider $S_0 = \alpha^{-1}I$, then the posterior becomes

\[m_N = \beta S_N \Phi^T t\]

\[S_N^{-1} = \alpha I + \beta \Phi^T \Phi\]

Maximizing this posterior wrt $w$ is equivalent to the minimization of squre sum error with regularization $\lambda = \alpha/\beta$

\[\log p(w|t) = -\frac{\beta}{2} \sum (t_n - w^T \phi(x_n))^2 - \frac{\alpha}{2} w^Tw\]

4.3.2. Predictive Distribution

The predictive distribution becomes

\[p(t|\mathbf{t}, \alpha, \beta) = \int p(t | w, \alpha, \beta) p(w | \mathbf{t}, \alpha, \beta)\]

We can again think of it as a linear conditional model (i.e.: $p(y) = \int p(y|x) p(x)$), and obtain the predictive distribution

\[p(t |x, \mathbf{t}, \alpha, \beta) = \mathcal{N}(t | m_N^T \phi(x), \sigma^2_N(x))\]

where

\[\sigma_N^2(x) = 1/\beta + \phi(x)^TS_N \phi(x)\]

4.3.3. Model Evidence

In a full treatment, we introduce prior to hyperparameter $\alpha, \beta$ as well and marginalize wrt them as well

\[p(t|\mathbf{t}) = \int p(t | w, \alpha, \beta) p(w | \mathbf{t}, \alpha, \beta)p(\alpha, \beta | \mathbf{t}) dw d\alpha d\beta\]

This integral might be intractable, empricial bayes or evidence approximation provides a solution by setting $\alpha, \beta$ to a peaked value $\hat{\alpha}, \hat{\beta}$

\[p(t|\mathbf{t}) \approx p(t|\mathbf{t}, \hat{\alpha}, \hat{\beta}) = \int p(t | w, \hat{\beta}) p(w | \mathbf{t}, \hat{\alpha}, \hat{\beta})\]

where $\hat{\alpha}, \hat{\beta}$ can be obtained b maximizing marginal likelihood or evidence function $p(\mathbf{t} | \alpha, \beta)$

4.3.4. Dual Representations

The regularized error function is

\[J(w) = \frac{1}{2} \sum_n \{ w^T \phi(x_n) - t_n \}^2 + \frac{\lambda}{2} w^Tw\]

By taking derivative of $w$ to 0, we obtain

\[w = \Phi^T a\]

where $a_n = -\frac{1}{\lambda}(w^T\phi(x_n) - t_n)$ and $\Phi$ is the design matrix.

substituting this into the previous formula, we get the dual representation

\[J(a) = \frac{1}{2}a^TKKa - a^TKt + \frac{1}{2}t^T t + \frac{\lambda}{2}a^TKa\]

where $K$ is the Gram matrix (i.e. $K_{nm} = \phi(x_n)^T \phi(x_m)$)

Solving this with respect to $a$, we obtain

\[a = (K + \lambda I_N)^{-1} t\]

and for a new input $x$

\[y(x) = w^T \phi(x) = a^T \Phi \phi(x) = k(x)^T (K+\lambda I_N)^{-1}t\]

4.4. General Prediction Theroy

By assuming $X$ is also a random variable, we are interested in picking up a predictor $f(x)$ such that minimizing the mean square error wrt the joint distribution of $(x,y)$

\[E_{x,y} [y - f(x)]\]

It turns out that the best predictor is the conditional expectation of $y$ given $x$

Theorem (best predictor) let $m(x) = E(y|x)$, then for any other predictor $f(x)$

\[E[y-m(x)]^2 \leq E[y - f(x)]\]

thus $m(x)$ is the best predictor of $y$

proof of the best predictor

Consider the usualy decomposition:

\[\begin{aligned} E_{x,y}[y-f(x)]^2 &= E_{x,y}[y - m(x) + m(x) - f(x)]^2 \\ &= E[y-m(x)]^2 + E[m(x)-f(x)]^2 + 2E[(y-m(x))(m(x)-f(x))] \end{aligned}\]

Then first two terms are nonnegative and the last term is 0 because

\[\begin{aligned} E_{x,y}[(y-m(x))(m(x)-f(x))] &= E_x[E_y[(y-m(x))(m(x)-f(x))| x]] \\ &= E_x[(m(x) - f(x) E_y[y-m(x) | x]] \\ &= E_x[(m(x) - f(x)) 0] \\ &= 0 \end{aligned}\]

4.5. Generalized Linear Models

In a generalized linear model, each $Y$ is assumed to be generated from a exponential family distribution such that:

\[E(Y) = \mu = g^{-1}(X \beta)\]

where $g$ is the link function: $g(\mu) = X\beta$

5. Variable Selections

5.1. Criterion

5.1.1. R^2

6. Time Series Analysis

Follows the book Time Series Analysis With Applications in R

6.1. Stationary Models

6.1.1. ARMA

6.2. Nonstationary Models

7. Causal Inference

Correlation does not imply causation

Definition (causal inference framework)

there are two possible actions: active treatment (or just treatment), and the other as the control treatment.
For each unit and the two treatment, we associate two potential outcomes: $Y_i(1), Y_i(0)$ (random variables). Every unit only receives one of the two potential outcomes, which is known as the fundamental problem of causal inference

Definition (average treatment effect) we can measure the casual association with potential outcomes

\[\tau = E[Y(1) - Y(0)]\]

Again the problem here is for each $i$-th unit, we only observe $Y(1)$ or $Y(0)$. What we observe is

\[Y_i^{obs} = Y_i(1)W_i + Y_i(0)(1-W_i)\]

where $W_i$ is the binary treatment random variable

The only quantity we can observe here is

\[\alpha = E[Y(1)|W=1] - E[Y(0)|W=0]\]

8. Reference

[1] Christensen, Ronald. Plane answers to complex questions. Vol. 35. No. 1. New York: Springer, 2002.　

[2] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. "The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics." Springer New York (2009). online pdf

[3] Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.