Skip to content

0x404 Nonparametric Models

An important limitation of parametric model is that the chosen density might be a poor model of the true distribution (e.g: Gaussian model is a poor model of the multimodal distribution). The nonparametric model make few assumptions about the form of the distribution.

In nonparametric model, we assume the distribution \(p\) belongs to some "massive" class \(\mathcal{P}\) of densities. For example, \(\mathcal{P}\) can be the set of all continuous probability density or Lipschitz continuous probability density.


1. Frequentist Estimation

To estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point. Let us consider some small region \(\mathcal{R}\) containing \(x\), the probability of this region is given by

\[P = \int_\mathcal{R} p(x) dx\]

Suppose we have \(N\) data points, the number of points falling inside the region \(K\) is rouhgly \(K \sim NP\). Assume the region \(\mathcal{R}\) has constant probability over the region volume \(V\), then

\[P = p(x)V\]

we have the density estimate

\[p(x) = \frac{KV}{N}\]

where \(K,V\) are interdependent on each other.

1.1. Nearest Neighbor

1.1.1. Density Estimation

NN approach fixes \(K\) to find an appropriate \(V\). In the Nearest-Neighbor density estimation, we allow the radius of sphere from \(x\) to grow until it contains precisely \(K\) data point. then we apply the estimation

\[p(x) = \frac{KV}{N}\]

1.1.2. Classification

The density estimation can be extended to the classification problem. we draw a sphere centred on \(x\) containing \(K\) points: \(N_k\) from class \(C_k\). then density for each class is

\[p(x | C_K) = \frac{K_k}{N_kV}\]

Suppose the unconditional density is \(p(x) =K/NV\) and priors are \(p(C_k) = N_k/N\), we obtain the posterior as

\[p(C_k | x) = \frac{K_k}{K}\]

Maximizing the probability of misclassification is to identify the largest number \(K_k\)

In the low-dimension, KD-tree is effective, but in high-dimensional, nearest neighbor search is expensive due to the curse of dimensionality.

paper (jegou2010product) Product quantization for nearest neighbor search

  • quantization step: split the vector \(x \in R^D\) into \(m\) distinct subvectors \(u_i(x) \in R^{D^*}\) where \(mD^* = D\), apply subquantizer \(q_i\) (k-means) to each subvector.
  • seach step: quantize both key and query or quantize only key. All key's subvector distance should be precomputed. In the first case, the key-query comparison is reduced to \(m\) addition.
  • nonexhaustive search: Follow Video-Google paper, use two quantizers applied to keys (coarse quantizer and residue quantizer), use coarse quantizer to lookup a inverted index, then go through the list using the residue.

1.2. Kernel Density Estimation

Kernel Density Estimation is to fix \(V\) and determine \(K\) from data. Let the empirical distribution function be

\[F_n(x) = \frac{1}{n} \sum_i I(X_i \leq x)\]

For sufficiently small \(h\), we can write the approximation with $p(x) = (F(x+h) - F(x-h))/2h, which gives rise to the Rosenblatt estimator

\[\hat{p}_n(x) = \frac{F_n(x+h) - F_n(x-h)}{2h} = \frac{1}{nh}\sum_i K_0 (\frac{X_0 - x}{h})\]

where \(K_0 = \frac{1}{2} I(-1 \leq u \leq 1)\).

Estimator (Kernel Density Estimator, Parzen-Rosenblatt) A simple generation is given by

\[\hat{p}_n(x) = \frac{1}{nh} \sum_i K(\frac{X_i - x}{h})\]

where \(K\) is called kernel and should be normalized to 1 \(\int K(u) du = 1\), \(h\) is called bandwidth

1.2.1. MSE Analysis

The MSE at an arbitrary fixed point \(x_0\) is

\[MSE(x_0) = E_p [ (\hat{p}_n (x) - p(x_0)^2]\]

which, as usual, can be decomposed to bias and variance

\[MSE = b^2(x_0) + \sigma^2(x_0)\]

where bias is

\[b(x_0) = E_p[\hat{p}_n(x_0)] - p(x_0)\]

and variance is

\[\sigma^2(x_0) = E_p[(\hat{p}_n(x_0) - p(x_0))^2]\]

Proposition (variance upper bound) Suppose the density \(p(x)\) satisfies \(p(x) < p_{max} < \infty\) for all \(x \in R\) and \(\int K^2(x) dx < \infty\), then

\[\sigma^2(x_0) \leq \frac{C1}{nh}\]

where \(C_1 = p_{max}\int K^2(x) dx\)

Definition (Holder class) Let \(T\) be an interval in \(R\) and \(\beta\), \(L\) be positive. The Holder class \(\Sigma(\beta, L)\) on \(T\) is defined as the set \(l = \lfloor \beta \rfloor\)-times differentiable function such that

\[|f^{l}(x) - f^{l}(y)| \leq L |x - y|^{\beta - l}, \forall(x,y \in T)\]

Definition (order of kernel) We say kernel \(K\) is a kernel of order \(l\) if the function \(u^j K(u)\) is integrable for \(j = 0, 1, ..., l\) and

\[\int u^jK(u) = 0\]

Proposition (bias upper bound) Assume \(p \in \Sigma(\beta, L)\) and \(K\) be a kernel of order \(l = \lfloor \beta \rfloor\) satisfying

\[\int |u|^\beta |K(u)| du < \infty\]

Then for all \(x_0, h > 0, n\geq 1\) we have

\[|b(x_0)| \leq C_2 h^{\beta}\]

where \(C_2 = \frac{L}{l!} \int |u|^{\beta} |K(u)| du\)

Lemma (risk upper bound) By the previous analysis, we know

\[MSE \leq C_2^2 h^{2\beta} + \frac{C1}{nh}\]

Notice when \(h\) is increasing, the bias increases but variance decreases. Setting up the correct \(h\) is important.


Optimizing the previous objective, we obtain

\[MSE(x_0) = O(n^{-\frac{2\beta}{2\beta+1}})\]

1.3. Kernel Regression

2. Bayesian Estimation

Larry's PDF is very good

2.1. Gaussian Process

Recall a Gaussian distribution is specified by a mean vector \(\mu\) and covariance matrix \(\Sigma\)

\[f = (f_1, f_2, ..., f_n) \sim \mathcal{N}(\mu, \Sigma)\]

Gaussian Process is a generalization of multivariate Gaussian distribution to infinitely many variables.

Definition (Gaussian Process) GP is a collection of random variables, any finite number of which have consistent Gaussian distributions. The general form is fully specified by a mean function \(\mu(x)\) and a covariance function \(K(x, x')\).

\[f(x) \sim \mathcal{GP}(m(x), K(x, x'))\]

2.1.1. GP Regression

In the GP regression viewpoint, instead of modeling the prior over parameters \(w \sim \mathcal{N}(0, \alpha^{-1} I)\), we define the prior probability over \(f(x)\) directly

\[f(x) \sim \mathcal{GP}(0, K(x, x'))\]

In practice, we observe \(N\) data points, which have the distribution

\[y_{1:N} \sim \mathcal{N}(0, K_n)\]

where \(K_n\) is the Gram matrix

Further considering the observed taret values is modeled by

\[p(t_{1:N} | y_{1:N} ) = \mathcal{N}(t_{1:N} | y_{1:N}, \beta^{-1} I_N)\]

The marginal distribution becomes

\[p(t_{1:N}) = \int p(t_{1:N} | y_{1:N})p(y_{1:N}) dy = \mathcal{N}(0, C)\]

where the covariance matrix is

\[C(x_n, x_m) = k(x_n, x_m) + \beta^{-1} \delta_{n,m}\]

Further considering the new data point \(t_{N+1}\), the joint distribution is

\[p(t_{N+1}, t_{1:N}) = \mathcal{N}(0, C_{N+1})\]

where \(C_{N+1}\) is defined as

\[C_{N+1} = \begin{pmatrix} C_N & k \\ k^T & c \\ \end{pmatrix}\]

Applying the conditional rules \(P(t_{N+1} | t_{1:N})\), we know

\[p(t_{N+1} | t_{1:N}) \sim \mathcal{N}(k^T C_N^{-1}t_{1:N}, c - k^T C_N^{-1} k)\]

The central computational operation is the inversion of \(C_N\) matrix \(N \times N\), usually \(O(N^3)\)

Some useful links:

2.2. Dirichlet Process

Dirichlet Process is a generalization of Dirichlet distribution which model the prior of infinite mixture models.

It is specified by

  • base distribution \(H\): expected value of the process (mean of drawed distribution)

  • concentration positive parameter \(\alpha\): a smaller \(\alpha\) indicates a higher concentration

The process is modeled by

\[G \sim DP(H, \alpha)\]
\[X_n | G \sim G\]

It can be interpreted sequentially such that

\[X_n | X_1, ..., X_{n-1} = \begin{cases} X_i \text{ with prob }\frac{1}{n-1+\alpha} \\ G_0 \text{ with prob } \frac{\alpha}{n-1+\alpha}\end{cases}\]

It can be used, for example, to model the clustring without predefining the \(k\) cluster size

3. Minmax Risk Lower Bound

4. Asymptotics

5. Reference

[1] Tsybakov, Alexandre B. "Introduction to Nonparametric Estimation." Springer series in statistics (2009): I-XII.