0x403 Parametric Models
- 1. Estimation
- 2. Hypothesis Testing
- 3. ANOVA
- 4. Regression Analysis
- 5. Variable Selections
- 6. Time Series Analysis
- 7. Causal Inference
- 8. Reference
1. Estimation
In general, a parameterization for the mean vector \(E(Y)\) consists of writing \(E(Y)\) as a function of some parameters \(\beta\), say
A linear model has the form
A parameterization is identifiable if knowing \(E(Y)\) (i.e. \(f(\beta)\)) tells you the parameter vector \(\beta\)
Definition (identifiable) the parameter \(\beta\) is identifiable if for any \(\beta_1, \beta_2\), \(f(\beta_1) = f(\beta_2)\) implies \(\beta_1 = \beta_2\). Moreover a vector-valued function \(g(\beta)\) is identifiable if \(f(\beta_1) = f(\beta_2)\) implies \(g(\beta_1) = g(\beta_2)\).
regression model
In regression model with full rank \(r(X) = p\), then \(X^TX\) is non-singular, \(X\beta_1 = X\beta_2\) implies \(\beta_1 = \beta_2\), identifiability holds
linear functions of parameters that are identifiable are called estimable
Definition (estimable) A vector-valued linear function of \(\beta\), say, \(\Lambda^T \beta\) is estimable if \(\Lambda^T \beta = P^T X\beta\) for some matrix \(P\)
Proposition (estimable and linear unbiased estimator) \(\lambda^T \beta\) is estimable iff it has linear unbiased estimator: there exists \(\rho\) such that \(E(\rho^T Y) = \lambda^T \beta\) for all \(\beta\)
1.1. Ordinary Least Square (OLS)
Consider the least square estimation over the standard linear model
1.1.1. Coefficient Estimation
we know \(E(Y) = X \beta\) where \(\beta\) is undecided, so we know \(E(Y)\) is within \(C(X)\)
Definition (Least Square Estimate, LSE) we want to take the vector in \(C(X)\) that is closest to \(Y\). in such a case \(\hat{\beta}\) is called the least square estimate
For a vector \(\Lambda^T \beta\), a least square estimate is defined as \(\Lambda^T \hat{\beta}\)
The LSE in this simple case is also sometimes called the OLS or Ordinary Least Square
Theorem (LSE and projection) \(\hat{\beta}\) is a LSE of \(\beta\) iff
where \(M\) is the perpendicular projection operator onto \(C(X)\)
Note the LSE does not need to be unique if the target is not identifiable
non-unique LSE
Consider the simple case where
Then \(\beta\) is not identifiable, LSE \(\hat{\beta}\) can be, for example, \((0, 1/4)^T\) or \((1/2, 0)^T\)
Recall the projection matrix can be written as
where \(X^g\) is the generalized inverse. While the generalized inverse is not unique (always exists though), the projection \(M\) is unique regardless of which generalized inverse. We can, for example, use the pseudo-inverse \(X^\dagger\)
Corollary (LSE with pseudo-inverse) the following is "one" of the LSE of \(\beta\), there might be other LSE
where \(X^\dagger\) is the pseudo-inverse.
While LSE of \(\beta\) might not be unique, the LSE of identifiable/estimable functions are unique.
Corollary (LSE and estimable function) Recall \(\lambda^T \beta\) is estimable if \(\lambda^T = \rho^TX\), so we know the following LSE is unique:
Also recall here that \(M\) is unique
Proposition (LSE are unbiased) LSE of estimable functions are unbiased: if \(\lambda^T = \rho^T X\)
if \(Y \sim N(X \beta, \sigma^2 I)\), we can show
if \(X^TX\) is nonsingular, then \(\beta\) is estimable and
1.1.2. Variance Estimation
It is obvious the residuals can be written as
It is reasonable to use \((I-M)Y\) to estimate \(\sigma^2\)
Theorem (unbiased estimate of variance) Let \(r(X) = r\), \(Cov(e) = \sigma^2 I\), then the following is an unbiased estimate of \(\sigma^2\)
MLE of variance
The LSE estimates are unbiased, but esimtate of \(\sigma^2\) is not
Definition (SSE, sum of squared error) The sum of squares of error is
Defintiion (MSE, mean square error) defined as
where \(r\) is the rank of \(X\) (and \(M\)). \(n-r=rank(I-M)\) is called the degrees of freedom for error, denoted b \(dfE\)
It is an unbiased estimate of \(\sigma^2\)
Suppose \(Y \sim N(X\beta, \sigma^2 I)\), the underlying distribution is
1.1.3. Best Linear Unbiased Estimator
Definition (Best Linear Unbiased Estimate, BLUE) \(a^TY\) is called a best linear unbiased estimate of \(\lambda^T \beta\) when \(a^T Y\) is unbiased and if for any other linear unbiased estimate \(b^T Y\)
Gauss-Markov claims the LSE are BLUE in the standard linear model
Definition (Gauss-Markov) Recall the standard linear model
If \(\lambda^T \beta\) is estimable, then the least squares estimate of \(\lambda^T \beta\) is a BLUE of \(\lambda^T \beta\)
1.1.4. Minimum Variance Unbiased Estimate (MVUE)
Consider the model is
A vector-valued sufficient statistic \(T(Y)\) is said to be complete if \(E(h(T(Y))) =0\) for all \(\beta, \sigma^2\) implies \(P[h(T(Y)) = 0] = 1\) for all \(\beta, \sigma^2\)
Theorem (Lehmann–Scheffé theorem, Minimum Variance Unbiased Estimate) If \(T(Y)\) is a complete sufficient statistic, then \(f(T(Y))\) is a minimum variance unbiased estimate (MVUE) of \(E[f(T(Y))]\)
1.2. Generalized Least Square
A more general version of linear model assumes the following formulation:
where \(V\) is some known PSD matrix rather than identity
weighted least square
weighted least square is a special case that \(V\) is a diagonal matrix
The least square here minimizes the following objective:
\(\hat{\beta}\) is a generalized least square estimate iff it satisfies the following:
This criterion is a generalized version of \(X \hat{\beta} = MY\) in OLS
1.3. Biased Estimators
1.3.1. Bias-Variance Tradeoff
Consider again the standard linear model
The MSE of OLS can be measured by
It is unbiased estimator, so \(\sigma^2 r(X)\) is the variance of this model
Here consider a reduced model
THE MSE of this model can be decomposed into
where the first term is bias and second term is variance
It is clear thean MSE get improved in the reduced model when \(\|X_\beta - M_0 X \beta \|^2 < \sigma^2(r(X) - r(X_0))\)$
1.3.2. Ridge Regression
Ridge regression shrinks the regression coefficients by imposing a penalty on their size.
An equivalent way to write it is a constrained optimization problem
where there is a 1-to-1 mapping between \(\lambda, t\)
The solution is
Using SVD of \(X = U\Sigma V^T\), we can compare the \(\hat{Y}^\text{LSE}\) and \(\hat{Y}^\text{ridge}\)
where \(d_i\) are singular values of \(X\)
Ridge regression computes the coordinates of \(Y\) with respect to the orthonormal basis \(U\). It then shrinks these coordinates by the factors of \(\frac{d_i^2}{d_i^2 + \lambda}\)
1.3.3. Lasso
2. Hypothesis Testing
2.1. Model Testing
Test (t-test) Assume that \(Y \sim N(X \beta, \sigma^2 I)\), if \(X^TX\) is nonsingular (i.e. \(X\) is full rank), then we know the following sampling distribution of estimators \(\hat{\beta}, \hat{\sigma}^2\):
and
To test a specific coefficient \(\beta_j = 0\), we form the following statistics if \(\sigma\) is unknown, we use estimate \(\hat{\sigma}\) instead"
If \(\sigma\) is known, it becomes a normal distribution
Test (F-test) To test a reduction model with a group of variables. Consider a full model
and a reduced model
If the full model is true, then
If the reduced model is true
The last one provides a distribution for the test statistic under the null hypothesis (which is the reduced model), we reject \(H_0\) at level \(\alpha\) if
2.2. Linear Parmetric Function Testing
3. ANOVA
Analysis of Variance (ANOVA) is about analyzing variances but rather about analyzing variation in means
3.1. Oneway Analysis of Variance
Definition (oneway ANOVA assumption) Random variables \(Y_{i,j}\) are observed according to the model
where
- \(E \epsilon_{i,j} = 0, \text{Var}(\epsilon_{i,j}) = \sigma_i^2 = \sigma^2 < \infty\)
- \(\text{Cov}(\epsilon_{i,j}, \epsilon_{i', j'}) = 0\)
- \(\epsilon_{i,j}\) are independent and normally distributed
3.2. Balanced Two-way ANOVA
The balanced two-way ANOVA without interaction model is generally written a
4. Regression Analysis
4.1. Simple Linear Regression
Consider the simple model
where \(Y, e\) are random variables, \(X\) is the fixed value and \(\beta\) are (fixed) index or parameter.
The solution is
where \(s_x, s_y\) are sample std, \(r_{x,y}\) are sample correlation coefficient
4.2. Sparse Linear Models
See the SLS book and High Dimensional Statistics chapter 7
Consider the linear model
where the number of predictors \(d\) has the same scale of the number of sample size \(n\). It is necessary to impose additional structure on the regression vector \(\theta^*\)
Definition (hard sparsity) the support set of \(\theta^*\) has cardinality \(s = |S(\theta^*)|\) substancially less than \(d\). Nonzero elements should be less than \(s\)
Definition (weakly sparse) A weak sparsity is to relax the hard sparsity and approximate a sparse vector
In the special case \(q=0\), the weak sparsity becomes the hard sparsity.
4.2.1. Noisyless Inference
4.2.2. Noisy Inference
4.3. Bayesian Estimation
In a full Bayesian treatment, the parameter \(w\) has a distribution, the conjugate prior can be given as follows (if variance is known)
4.3.1. Posterior Distribution
With the likelihood,
The posterior is then given by the linear condition model:
where
simple case
Consider \(S_0 = \alpha^{-1}I\), then the posterior becomes
Maximizing this posterior wrt \(w\) is equivalent to the minimization of squre sum error with regularization \(\lambda = \alpha/\beta\)
4.3.2. Predictive Distribution
The predictive distribution becomes
We can again think of it as a linear conditional model (i.e.: \(p(y) = \int p(y|x) p(x)\)), and obtain the predictive distribution
where
4.3.3. Model Evidence
In a full treatment, we introduce prior to hyperparameter \(\alpha, \beta\) as well and marginalize wrt them as well
This integral might be intractable, empricial bayes or evidence approximation provides a solution by setting \(\alpha, \beta\) to a peaked value \(\hat{\alpha}, \hat{\beta}\)
where \(\hat{\alpha}, \hat{\beta}\) can be obtained b maximizing marginal likelihood or evidence function \(p(\mathbf{t} | \alpha, \beta)\)
4.3.4. Dual Representations
The regularized error function is
By taking derivative of \(w\) to 0, we obtain
where \(a_n = -\frac{1}{\lambda}(w^T\phi(x_n) - t_n)\) and \(\Phi\) is the design matrix.
substituting this into the previous formula, we get the dual representation
where \(K\) is the Gram matrix (i.e. \(K_{nm} = \phi(x_n)^T \phi(x_m)\))
Solving this with respect to \(a\), we obtain
and for a new input \(x\)
4.4. General Prediction Theroy
By assuming \(X\) is also a random variable, we are interested in picking up a predictor \(f(x)\) such that minimizing the mean square error wrt the joint distribution of \((x,y)\)
It turns out that the best predictor is the conditional expectation of \(y\) given \(x\)
Theorem (best predictor) let \(m(x) = E(y|x)\), then for any other predictor \(f(x)\)
thus \(m(x)\) is the best predictor of \(y\)
proof of the best predictor
Consider the usualy decomposition:
Then first two terms are nonnegative and the last term is 0 because
4.5. Generalized Linear Models
In a generalized linear model, each \(Y\) is assumed to be generated from a exponential family distribution such that:
where \(g\) is the link function: \(g(\mu) = X\beta\)
5. Variable Selections
5.1. Criterion
5.1.1. R^2
6. Time Series Analysis
Follows the book Time Series Analysis With Applications in R
6.1. Stationary Models
6.1.1. ARMA
6.2. Nonstationary Models
7. Causal Inference
Correlation does not imply causation
Definition (causal inference framework)
- there are two possible actions: active treatment (or just treatment), and the other as the control treatment.
- For each unit and the two treatment, we associate two potential outcomes: \(Y_i(1), Y_i(0)\) (random variables). Every unit only receives one of the two potential outcomes, which is known as the fundamental problem of causal inference
Definition (average treatment effect) we can measure the casual association with potential outcomes
Again the problem here is for each \(i\)-th unit, we only observe \(Y(1)\) or \(Y(0)\). What we observe is
where \(W_i\) is the binary treatment random variable
The only quantity we can observe here is
8. Reference
[1] Christensen, Ronald. Plane answers to complex questions. Vol. 35. No. 1. New York: Springer, 2002.
[2] Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. "The elements of statistical learning: data mining, inference, and prediction. Springer series in statistics." Springer New York (2009). online pdf
[3] Bishop, Christopher M., and Nasser M. Nasrabadi. Pattern recognition and machine learning. Vol. 4. No. 4. New York: springer, 2006.