0x404 Nonparametric Models

1. Frequentist Estimation
2. Bayesian Estimation
- 2.1. Gaussian Process
  - 2.1.1. GP Regression
- 2.2. Dirichlet Process
3. Minmax Risk Lower Bound
4. Asymptotics
5. Reference

An important limitation of parametric model is that the chosen density might be a poor model of the true distribution (e.g: Gaussian model is a poor model of the multimodal distribution). The nonparametric model make few assumptions about the form of the distribution.

In nonparametric model, we assume the distribution $p$ belongs to some "massive" class $\mathcal{P}$ of densities. For example, $\mathcal{P}$ can be the set of all continuous probability density or Lipschitz continuous probability density.

non_para

1. Frequentist Estimation

To estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point. Let us consider some small region $\mathcal{R}$ containing $x$, the probability of this region is given by

\[P = \int_\mathcal{R} p(x) dx\]

Suppose we have $N$ data points, the number of points falling inside the region $K$ is rouhgly $K \sim NP$. Assume the region $\mathcal{R}$ has constant probability over the region volume $V$, then

\[P = p(x)V\]

we have the density estimate

\[p(x) = \frac{KV}{N}\]

where $K,V$ are interdependent on each other.

1.1. Nearest Neighbor

1.1.1. Density Estimation

NN approach fixes $K$ to find an appropriate $V$. In the Nearest-Neighbor density estimation, we allow the radius of sphere from $x$ to grow until it contains precisely $K$ data point. then we apply the estimation

\[p(x) = \frac{KV}{N}\]

1.1.2. Classification

The density estimation can be extended to the classification problem. we draw a sphere centred on $x$ containing $K$ points: $N_k$ from class $C_k$. then density for each class is

\[p(x | C_K) = \frac{K_k}{N_kV}\]

Suppose the unconditional density is $p(x) =K/NV$ and priors are $p(C_k) = N_k/N$, we obtain the posterior as

\[p(C_k | x) = \frac{K_k}{K}\]

Maximizing the probability of misclassification is to identify the largest number $K_k$

1.1.3. Search

In the low-dimension, KD-tree is effective, but in high-dimensional, nearest neighbor search is expensive due to the curse of dimensionality.

paper (jegou2010product) Product quantization for nearest neighbor search

quantization step: split the vector $x \in R^D$ into $m$ distinct subvectors $u_i(x) \in R^{D^*}$ where $mD^* = D$, apply subquantizer $q_i$ (k-means) to each subvector.
seach step: quantize both key and query or quantize only key. All key's subvector distance should be precomputed. In the first case, the key-query comparison is reduced to $m$ addition.
nonexhaustive search: Follow Video-Google paper, use two quantizers applied to keys (coarse quantizer and residue quantizer), use coarse quantizer to lookup a inverted index, then go through the list using the residue.

1.2. Kernel Density Estimation

Kernel Density Estimation is to fix $V$ and determine $K$ from data. Let the empirical distribution function be

\[F_n(x) = \frac{1}{n} \sum_i I(X_i \leq x)\]

For sufficiently small $h$, we can write the approximation with $p(x) = (F(x+h) - F(x-h))/2h, which gives rise to the Rosenblatt estimator

\[\hat{p}_n(x) = \frac{F_n(x+h) - F_n(x-h)}{2h} = \frac{1}{nh}\sum_i K_0 (\frac{X_0 - x}{h})\]

where $K_0 = \frac{1}{2} I(-1 \leq u \leq 1)$.

Estimator (Kernel Density Estimator, Parzen-Rosenblatt) A simple generation is given by

\[\hat{p}_n(x) = \frac{1}{nh} \sum_i K(\frac{X_i - x}{h})\]

where $K$ is called kernel and should be normalized to 1 $\int K(u) du = 1$, $h$ is called bandwidth

1.2.1. MSE Analysis

The MSE at an arbitrary fixed point $x_0$ is

\[MSE(x_0) = E_p [ (\hat{p}_n (x) - p(x_0)^2]\]

which, as usual, can be decomposed to bias and variance

\[MSE = b^2(x_0) + \sigma^2(x_0)\]

where bias is

\[b(x_0) = E_p[\hat{p}_n(x_0)] - p(x_0)\]

and variance is

\[\sigma^2(x_0) = E_p[(\hat{p}_n(x_0) - p(x_0))^2]\]

Proposition (variance upper bound) Suppose the density $p(x)$ satisfies $p(x) < p_{max} < \infty$ for all $x \in R$ and $\int K^2(x) dx < \infty$, then

\[\sigma^2(x_0) \leq \frac{C1}{nh}\]

where $C_1 = p_{max}\int K^2(x) dx$

Definition (Holder class) Let $T$ be an interval in $R$ and $\beta$, $L$ be positive. The Holder class $\Sigma(\beta, L)$ on $T$ is defined as the set $l = \lfloor \beta \rfloor$-times differentiable function such that

\[|f^{l}(x) - f^{l}(y)| \leq L |x - y|^{\beta - l}, \forall(x,y \in T)\]

Definition (order of kernel) We say kernel $K$ is a kernel of order $l$ if the function $u^j K(u)$ is integrable for $j = 0, 1, ..., l$ and

\[\int u^jK(u) = 0\]

Proposition (bias upper bound) Assume $p \in \Sigma(\beta, L)$ and $K$ be a kernel of order $l = \lfloor \beta \rfloor$ satisfying

\[\int |u|^\beta |K(u)| du < \infty\]

Then for all $x_0, h > 0, n\geq 1$ we have

\[|b(x_0)| \leq C_2 h^{\beta}\]

where $C_2 = \frac{L}{l!} \int |u|^{\beta} |K(u)| du$

Lemma (risk upper bound) By the previous analysis, we know

\[MSE \leq C_2^2 h^{2\beta} + \frac{C1}{nh}\]

Notice when $h$ is increasing, the bias increases but variance decreases. Setting up the correct $h$ is important.

Optimizing the previous objective, we obtain

\[MSE(x_0) = O(n^{-\frac{2\beta}{2\beta+1}})\]

1.3. Kernel Regression

2. Bayesian Estimation

Larry's PDF is very good

2.1. Gaussian Process

Recall a Gaussian distribution is specified by a mean vector $\mu$ and covariance matrix $\Sigma$

\[f = (f_1, f_2, ..., f_n) \sim \mathcal{N}(\mu, \Sigma)\]

Gaussian Process is a generalization of multivariate Gaussian distribution to infinitely many variables.

Definition (Gaussian Process) GP is a collection of random variables, any finite number of which have consistent Gaussian distributions. The general form is fully specified by a mean function $\mu(x)$ and a covariance function $K(x, x')$.

\[f(x) \sim \mathcal{GP}(m(x), K(x, x'))\]

2.1.1. GP Regression

In the GP regression viewpoint, instead of modeling the prior over parameters $w \sim \mathcal{N}(0, \alpha^{-1} I)$, we define the prior probability over $f(x)$ directly

\[f(x) \sim \mathcal{GP}(0, K(x, x'))\]

In practice, we observe $N$ data points, which have the distribution

\[y_{1:N} \sim \mathcal{N}(0, K_n)\]

where $K_n$ is the Gram matrix

Further considering the observed taret values is modeled by

\[p(t_{1:N} | y_{1:N} ) = \mathcal{N}(t_{1:N} | y_{1:N}, \beta^{-1} I_N)\]

The marginal distribution becomes

\[p(t_{1:N}) = \int p(t_{1:N} | y_{1:N})p(y_{1:N}) dy = \mathcal{N}(0, C)\]

where the covariance matrix is

\[C(x_n, x_m) = k(x_n, x_m) + \beta^{-1} \delta_{n,m}\]

Further considering the new data point $t_{N+1}$, the joint distribution is

\[p(t_{N+1}, t_{1:N}) = \mathcal{N}(0, C_{N+1})\]

where $C_{N+1}$ is defined as

\[C_{N+1} = \begin{pmatrix} C_N & k \\ k^T & c \\ \end{pmatrix}\]

Applying the conditional rules $P(t_{N+1} | t_{1:N})$, we know

\[p(t_{N+1} | t_{1:N}) \sim \mathcal{N}(k^T C_N^{-1}t_{1:N}, c - k^T C_N^{-1} k)\]

The central computational operation is the inversion of $C_N$ matrix $N \times N$, usually $O(N^3)$

Some useful links:

2.2. Dirichlet Process

Dirichlet Process is a generalization of Dirichlet distribution which model the prior of infinite mixture models.

It is specified by

base distribution $H$: expected value of the process (mean of drawed distribution)
concentration positive parameter $\alpha$: a smaller $\alpha$ indicates a higher concentration

The process is modeled by

\[G \sim DP(H, \alpha)\]

\[X_n | G \sim G\]

It can be interpreted sequentially such that

\[X_n | X_1, ..., X_{n-1} = \begin{cases} X_i \text{ with prob }\frac{1}{n-1+\alpha} \\ G_0 \text{ with prob } \frac{\alpha}{n-1+\alpha}\end{cases}\]

It can be used, for example, to model the clustring without predefining the $k$ cluster size

3. Minmax Risk Lower Bound

4. Asymptotics

5. Reference

[1] Tsybakov, Alexandre B. "Introduction to Nonparametric Estimation." Springer series in statistics (2009): I-XII.