# 0x404 Nonparametric Models

An important limitation of parametric model is that the chosen density might be a poor model of the true distribution (e.g: Gaussian model is a poor model of the multimodal distribution). The nonparametric model make few assumptions about the form of the distribution.

In nonparametric model, we assume the distribution $$p$$ belongs to some "massive" class $$\mathcal{P}$$ of densities. For example, $$\mathcal{P}$$ can be the set of all continuous probability density or Lipschitz continuous probability density.

## 1. Frequentist Estimation

To estimate the probability density at a particular location, we should consider the data points that lie within some local neighbourhood of that point. Let us consider some small region $$\mathcal{R}$$ containing $$x$$, the probability of this region is given by

$P = \int_\mathcal{R} p(x) dx$

Suppose we have $$N$$ data points, the number of points falling inside the region $$K$$ is rouhgly $$K \sim NP$$. Assume the region $$\mathcal{R}$$ has constant probability over the region volume $$V$$, then

$P = p(x)V$

we have the density estimate

$p(x) = \frac{KV}{N}$

where $$K,V$$ are interdependent on each other.

### 1.1. Nearest Neighbor

#### 1.1.1. Density Estimation

NN approach fixes $$K$$ to find an appropriate $$V$$. In the Nearest-Neighbor density estimation, we allow the radius of sphere from $$x$$ to grow until it contains precisely $$K$$ data point. then we apply the estimation

$p(x) = \frac{KV}{N}$

#### 1.1.2. Classification

The density estimation can be extended to the classification problem. we draw a sphere centred on $$x$$ containing $$K$$ points: $$N_k$$ from class $$C_k$$. then density for each class is

$p(x | C_K) = \frac{K_k}{N_kV}$

Suppose the unconditional density is $$p(x) =K/NV$$ and priors are $$p(C_k) = N_k/N$$, we obtain the posterior as

$p(C_k | x) = \frac{K_k}{K}$

Maximizing the probability of misclassification is to identify the largest number $$K_k$$

In the low-dimension, KD-tree is effective, but in high-dimensional, nearest neighbor search is expensive due to the curse of dimensionality.

paper (jegou2010product) Product quantization for nearest neighbor search

• quantization step: split the vector $$x \in R^D$$ into $$m$$ distinct subvectors $$u_i(x) \in R^{D^*}$$ where $$mD^* = D$$, apply subquantizer $$q_i$$ (k-means) to each subvector.
• seach step: quantize both key and query or quantize only key. All key's subvector distance should be precomputed. In the first case, the key-query comparison is reduced to $$m$$ addition.
• nonexhaustive search: Follow Video-Google paper, use two quantizers applied to keys (coarse quantizer and residue quantizer), use coarse quantizer to lookup a inverted index, then go through the list using the residue.

### 1.2. Kernel Density Estimation

Kernel Density Estimation is to fix $$V$$ and determine $$K$$ from data. Let the empirical distribution function be

$F_n(x) = \frac{1}{n} \sum_i I(X_i \leq x)$

For sufficiently small $$h$$, we can write the approximation with \$p(x) = (F(x+h) - F(x-h))/2h, which gives rise to the Rosenblatt estimator

$\hat{p}_n(x) = \frac{F_n(x+h) - F_n(x-h)}{2h} = \frac{1}{nh}\sum_i K_0 (\frac{X_0 - x}{h})$

where $$K_0 = \frac{1}{2} I(-1 \leq u \leq 1)$$.

Estimator (Kernel Density Estimator, Parzen-Rosenblatt) A simple generation is given by

$\hat{p}_n(x) = \frac{1}{nh} \sum_i K(\frac{X_i - x}{h})$

where $$K$$ is called kernel and should be normalized to 1 $$\int K(u) du = 1$$, $$h$$ is called bandwidth

#### 1.2.1. MSE Analysis

The MSE at an arbitrary fixed point $$x_0$$ is

$MSE(x_0) = E_p [ (\hat{p}_n (x) - p(x_0)^2]$

which, as usual, can be decomposed to bias and variance

$MSE = b^2(x_0) + \sigma^2(x_0)$

where bias is

$b(x_0) = E_p[\hat{p}_n(x_0)] - p(x_0)$

and variance is

$\sigma^2(x_0) = E_p[(\hat{p}_n(x_0) - p(x_0))^2]$

Proposition (variance upper bound) Suppose the density $$p(x)$$ satisfies $$p(x) < p_{max} < \infty$$ for all $$x \in R$$ and $$\int K^2(x) dx < \infty$$, then

$\sigma^2(x_0) \leq \frac{C1}{nh}$

where $$C_1 = p_{max}\int K^2(x) dx$$

Definition (Holder class) Let $$T$$ be an interval in $$R$$ and $$\beta$$, $$L$$ be positive. The Holder class $$\Sigma(\beta, L)$$ on $$T$$ is defined as the set $$l = \lfloor \beta \rfloor$$-times differentiable function such that

$|f^{l}(x) - f^{l}(y)| \leq L |x - y|^{\beta - l}, \forall(x,y \in T)$

Definition (order of kernel) We say kernel $$K$$ is a kernel of order $$l$$ if the function $$u^j K(u)$$ is integrable for $$j = 0, 1, ..., l$$ and

$\int u^jK(u) = 0$

Proposition (bias upper bound) Assume $$p \in \Sigma(\beta, L)$$ and $$K$$ be a kernel of order $$l = \lfloor \beta \rfloor$$ satisfying

$\int |u|^\beta |K(u)| du < \infty$

Then for all $$x_0, h > 0, n\geq 1$$ we have

$|b(x_0)| \leq C_2 h^{\beta}$

where $$C_2 = \frac{L}{l!} \int |u|^{\beta} |K(u)| du$$

Lemma (risk upper bound) By the previous analysis, we know

$MSE \leq C_2^2 h^{2\beta} + \frac{C1}{nh}$

Notice when $$h$$ is increasing, the bias increases but variance decreases. Setting up the correct $$h$$ is important.

Optimizing the previous objective, we obtain

$MSE(x_0) = O(n^{-\frac{2\beta}{2\beta+1}})$

## 2. Bayesian Estimation

Larry's PDF is very good

### 2.1. Gaussian Process

Recall a Gaussian distribution is specified by a mean vector $$\mu$$ and covariance matrix $$\Sigma$$

$f = (f_1, f_2, ..., f_n) \sim \mathcal{N}(\mu, \Sigma)$

Gaussian Process is a generalization of multivariate Gaussian distribution to infinitely many variables.

Definition (Gaussian Process) GP is a collection of random variables, any finite number of which have consistent Gaussian distributions. The general form is fully specified by a mean function $$\mu(x)$$ and a covariance function $$K(x, x')$$.

$f(x) \sim \mathcal{GP}(m(x), K(x, x'))$

#### 2.1.1. GP Regression

In the GP regression viewpoint, instead of modeling the prior over parameters $$w \sim \mathcal{N}(0, \alpha^{-1} I)$$, we define the prior probability over $$f(x)$$ directly

$f(x) \sim \mathcal{GP}(0, K(x, x'))$

In practice, we observe $$N$$ data points, which have the distribution

$y_{1:N} \sim \mathcal{N}(0, K_n)$

where $$K_n$$ is the Gram matrix

Further considering the observed taret values is modeled by

$p(t_{1:N} | y_{1:N} ) = \mathcal{N}(t_{1:N} | y_{1:N}, \beta^{-1} I_N)$

The marginal distribution becomes

$p(t_{1:N}) = \int p(t_{1:N} | y_{1:N})p(y_{1:N}) dy = \mathcal{N}(0, C)$

where the covariance matrix is

$C(x_n, x_m) = k(x_n, x_m) + \beta^{-1} \delta_{n,m}$

Further considering the new data point $$t_{N+1}$$, the joint distribution is

$p(t_{N+1}, t_{1:N}) = \mathcal{N}(0, C_{N+1})$

where $$C_{N+1}$$ is defined as

$C_{N+1} = \begin{pmatrix} C_N & k \\ k^T & c \\ \end{pmatrix}$

Applying the conditional rules $$P(t_{N+1} | t_{1:N})$$, we know

$p(t_{N+1} | t_{1:N}) \sim \mathcal{N}(k^T C_N^{-1}t_{1:N}, c - k^T C_N^{-1} k)$

The central computational operation is the inversion of $$C_N$$ matrix $$N \times N$$, usually $$O(N^3)$$

### 2.2. Dirichlet Process

Dirichlet Process is a generalization of Dirichlet distribution which model the prior of infinite mixture models.

It is specified by

• base distribution $$H$$: expected value of the process (mean of drawed distribution)

• concentration positive parameter $$\alpha$$: a smaller $$\alpha$$ indicates a higher concentration

The process is modeled by

$G \sim DP(H, \alpha)$
$X_n | G \sim G$

It can be interpreted sequentially such that

$X_n | X_1, ..., X_{n-1} = \begin{cases} X_i \text{ with prob }\frac{1}{n-1+\alpha} \\ G_0 \text{ with prob } \frac{\alpha}{n-1+\alpha}\end{cases}$

It can be used, for example, to model the clustring without predefining the $$k$$ cluster size

## 5. Reference

[1] Tsybakov, Alexandre B. "Introduction to Nonparametric Estimation." Springer series in statistics (2009): I-XII.