# 0x410 Foundation

ML VS Statistics

• Statistics is often interested in asymptotic behavior, the theory of ML focuses on finite sample bounds
• Statistics often assums certain presubscribed data models, in ML, the emphasis is on working under a "distribution-free" setting

## 1. Foundation

### 1.1. Deterministic PAC learnable

Definition (input, output space) We denote by $$\mathcal{X}$$ the set of all possible examples or instances, which is referred to as the input space. The set of all possible labels or target values is denoted by $$\mathcal{Y}$$.

Definition (concept) A concept is a mapping $$c: \mathcal{X} \to \mathcal{Y}$$, or equivalently we can identify $$c$$ with the subset of $$\mathcal{X}$$ over which it takes the value 1. A concept class is a set of concepts we may wish to learn and denoted by $$C$$

Definition (hypothesis) The learner is requested to output a prediction rule $$h: \mathcal{X} \to \mathcal{Y}$$, the function is called predictor, hypothesis or classifier

Definition (true error, risk) Given a hypothesis $$h \in H$$, a target concept $$c \in C$$ and an underlying distribution $$D$$, the generalization error or risk of $$h$$ is defined by

$R(h) = P_{x \sim \mathcal{D}}(h(x) \neq c(x)) = E_{x \sim \mathcal{D}}[1_{h(x) \neq c(x)}]$

In a more general setting, the true risk can be written as

$R(h) = E_{z \sim \mathcal{D}}[l(h, z)]$

where $$l$$ is generalized loss function and $$z$$ is input (and output if exists)

This error is not directly accessible to the learner scince both the distribution $$D$$ and target concept $$c$$ are unknown. However, the learner can measure the empirical error of a hypothesis on the labeled sample $$S$$

Definition (empirical error, risk) Given a hypothesis $$h \in H$$, a target concept $$c \in C$$ and a sample $$S=(x_1, ..., x_m)$$, the empirical error or empirical risk of $$h$$ is defined by

$\hat{R}(h) = \frac{1}{m} \sum_{i=1}^m 1_{h(x_i) \neq c(x_9)}$

Similarly, in a more general setting, the empirical risk is

$\hat{R}(h) = \frac{1}{m} \sum_i l(h, z_i)$

It is important to notice that the empirical risk depends on the drawn sample, therefore it is a random variable

There are several relations between the empirical error and the true error, for example

$E[\hat{R}(h)] = R(h)$

Again, we can take expectation because it is a random variable.

For a concept class $$C$$ (or hypothesis space ($$\mathcal{H}$$)), we can define the PAC learnability. PAC framework helps define the class of learnable concepts in terms of the number of sample points needed to achieve an approximate solution (sample complexity)

Definition (PAC learnable) A concept class $$C$$ is said to be PAC-learnable if there exists an algorithm $$\mathcal{A}$$ and a polynomial function $$poly$$ such that $$\forall{\epsilon > 0}, \forall{\delta > 0}$$, forall distributions on $$\mathcal{X}$$ and any target concept $$c \in C$$, the following holds for any sample size $$m \geq poly(1/\epsilon, 1/\sigma, n, size(c))$$

$P_{S \sim D^m}[R(h_S) \leq \epsilon] \geq 1 - \delta$

It is approximately correct (error at most $$\epsilon$$) with high probability (at least $$1-\delta$$)

There are several key points here:

• the framework is a distribution-free model, no assumption on $$D$$
• PAC is defined with respect to the concept class, not a particular concept.

axis-aligned rectangles

Consider the problem where the set of instances are points in the plane, and the concept class $$C$$ is the set of all axis-aligned rectangles lying in the plane. Instance within the rectangle is labeled 1, otherwise 0.

This concept class is PAC-learnable because we can propose an algorithm to find the tightest axis-aligned rectangle containing the points labeled with 1.

This algorithm can be show to gaurantee the sample size $$m \geq \frac{4}{\epsilon} \log \frac{4}{\delta}$$ to achieve $$Pr(R(h_s) > \epsilon) \leq \delta$$

Or equivalently, the error is bounded by the sample size $$R(h_s) \leq \frac{4}{m} \log \frac{4}{\delta}$$ with probability at least $$1-\delta$$

linear separator

Let $$X \in R^d$$ and $$\mathcal{H}$$ be the set of linear functions. This class is also PAC-learnable

#### 1.1.1. Consistent Finite Hypothesis Set

Definition (consistency) When the hypothesis achieving zero training errors, the hypothesis is called consistent

$\hat{R}(h_S) = 0$

In this case, we can bound the true error as follows:

Theorem (finite hypothesis set, consistent case of PAC, Haussler) Suppose a model class has size $$H$$, dataset has $$m$$ iid samples, $$0 < \epsilon < 1$$, for any learned classifier that get 0 training error, the true error is bounded by

$P(R(h) > \epsilon) \leq |H| e^{-m\epsilon}$

Short proof is that when $$R(h) > \epsilon$$, the prob of training error getting 0 is $$(1-\epsilon)^m \leq e^{-m\epsilon}$$, then considering the prob of all hypothesis, we get the conclusion above.

PAC bound holds for all $$h$$ with 0 training error, but does not guarantee the algorithm find the best $$h$$. It shows that when the hypothesis set $$H$$ is finite, a consistent algorithm $$\mathcal{A}$$ is a PAC-learning algorithm.

By flipping the formula, we know the following relations: given $$\epsilon, \delta$$, yields sample complexity

$m \geq \frac{\log{|H|} + \log{(1/\delta)}}{\epsilon}$

Or equivalently given $$m, \delta$$ yields error bound

$\epsilon \geq \frac{\log{|H|} + \log{(1/\delta)}}{m}$

$$\log |H|$$ can be interpreted as the number of bits to represent $$H$$.

#### 1.1.2. Inconsistent Finite Hypothesis Set

In the case of non-zero training errors, we can use training error $$\hat{\theta}$$ to estimate the true error $$\theta$$. For example, we can use the Hoeffding's bound.

$P(|\theta - \frac{1}{m}\sum_i x_i| \geq \epsilon) \leq 2e^{-2m\epsilon^2}$

For a single classifier $$h$$

$P(|R(h) - \hat{R}(h)| \geq \epsilon) \leq 2e^{-2m\epsilon^2}$

Theorem (finite hypothesis set, inconsistent case of PAC) For any learned classifier $$h \in H$$:

$P(|R(h) - \hat{R}(h)| \geq \epsilon) \leq 2|H|e^{-2m\epsilon^2} \leq \delta$

With probability $$1-\delta$$,

$|R(h) - \hat{R}(h)| \leq \epsilon = \sqrt{\frac{\log{|H|} + \log{2/\delta} }{2m}}$

ompared with the consistent bound, this has a square root effect, which is expensive: suppose we fixed $$|H|$$, we need to attain same guarantee with quadratically larger sample size.

This also indicates some bias-variance tradeoffs, when we fixed $$m$$

• for large $$|H|$$, we have better fitting, so the training error $$\hat{R}(h)$$ is small, but right side is larger, causing larger variance
• for small $$|H|$$, training error is large (i.e: large bias), but smaller variance

### 1.2. Stochastic PAC learnable

In the previous section, we consider a deterministic scenario: a distribution $$D$$ over $$\mathcal{X}$$ and a concept $$c: \mathcal{X} \to \mathcal{Y}$$

In a more general stochastic scenario, we put a joint distribution $$D$$ over $$\mathcal{X} \times \mathcal{Y}$$

The generation error now becomes the following,

Definition (true error, risk) it should be contrasted with the determistic version

$R(h) = P_{(x,y) \sim D}[h(x) \neq y] = E_{(x,y) \sim D}[1_{h(x) \neq y}]$

it should be contrasted with the determistic version: $$R(h) = P_{x \sim \mathcal{D}}(h(x) \neq c(x))$$

The PAC learning is also extended as

Definition (agnostic PAC-learning) Let $$\mathcal{H}$$ be a hypothesis set. $$\mathcal{A}$$ is an agnostic PAC-learning algorithm if it runs in polynomial time with a polynomial size sample $$S$$ from all possible $$D$$ over $$\mathcal{X} \times \mathcal{Y}$$:

$P_{S \in D^m}[R(h_S) - \min_{h \in H} R(h) \leq \epsilon ] \geq 1-\delta$

In the deterministic case, there exists a target function $$f$$ with zero generalization error such that $$R(f) = 0$$, in stochastic case, the best hypothesis is called the Bayes hypothesis has non-zero error (Bayes error)

Definition (Bayes error, Bayes classifier) Given a distribution $$D$$ over $$\mathcal{X} \times \mathcal{Y}$$, the Bayes error $$R^*$$ is defined as the infimum of errors achieved by measurable functions $$h: \mathcal{X} \to \mathcal{Y}$$

$R^* = \inf_{h} R(h)$

A hypothesis $$h$$ with $$R(h) = R^*$$ is called a Bayes hypothesis or Bayes classifier,

Definition (Bayes classifier) It can be defined in terms of the condition

$h_{bayes}(x) = \text{argmax}_{y \in \{0, 1\}} P(y|x)$

Definition (estimation error, approximation error) The error can be decomposed with

$R(h) - R^* = (R(h) - R(h^*)) + (R(h^*) - R^*)$

where the $$h^* = \inf_{h \in H} R(h)$$ is the best hypothesis within the hypothesis set $$H$$

The decomposition has

• estimation error ($$R(h) - R(h^*)$$): measures the quality of $$h$$ wrt the best-in-class hypothesis
• approximation error ($$R(h^*) - R^*$$): measures how well $$H$$ approximates Bayes error. usually not accessible

In this note, we mainly focus on the estimation error, and the approximation error is assumed to be small (e.g: consistent hypothesis set). To estimate the approximation error, we can use, for example, sieve estimators. Check this lecture note

#### 1.2.1. Finite Hypothesis class

Finite hypothesis class is agnostic PAC learnable

### 1.3. Decision Theory

Mean square error loss is a special case of loss function, the study of the performance of estimators using loss function is a branch of decision theory. Decision theory, when combined with probability theory, allows us to make optimal decisions involving uncertainty.

It usually has two steps:

• determination of $$p(x,t)$$ from the training set is the inference step
• based on the distribution, we make the optimal choice in some appropriate sense, this is the decision step

Definition (regression function) In the regression setting, it is still common to use the squared error loss, therefore the criterion is

$R(f) = E(Y-f(X))^2 = \int [y-f(x)]^2 Pr(dx, dy)$

By conditioning on $$X$$, we write

$R(f) = E_X E_{Y|X} ([Y-f(X)]^2 | X)$

which we can minimze pointwisely by

$f(x) = E(Y|X=x)$

This is known as the regression function

nearest neighbor

The nearnest neighbor approximate this regression function by

$\hat{f}(x) = Ave(y_i | x_i \in N_k(x))$

with two approximations:

• expectation is approximated by averaging over sample data
• conditioning at a point is relatexed to conditioning on some close region

Under some regularity condition, it can be shown that $$\hat{f}(x) \to E(Y|X=x)$$ given $$N, k \to \infty$$

linear model

The linear model assumes the regression function $$f(x)$$ is approximately linear

$f(x) = x^T \beta$

by minimizing the previous risk, we can solve for $$\beta$$ theoretically

$\beta = [E(XX^T)]^{-1} E(XY)$

Definition (Bayes classifier) In the classification setting, the risk function can be represented by a $$K \times K$$ matrix $$L$$ which represents the penalty, in the simple case, we use zero-one loss function where all misclassifications are charged 1

$R(g) = E[L(Y, g(X))] = E_X \sum_{k=1}^K L[Y=k, g(X)] P(Y=k | X)$

Similarly, this can also be minimized pointwisely

$\hat{g}(x) = argmin_k 1 - P(Y=k |X=x) = max_k P(Y=k | X=x)$

This reasonable solution is known as the Bayes classifier, recall Naive Bayes classifier is an example of this. It is maximizing the posterior (MAP)

### 1.4. Model Selection

#### 1.4.1. Empirical Risk Minimization

It makes sense to search a solution that works well on the training set, so we can find a predictor $$h$$ that minimizes $$\hat{R}(h)$$. This is called Empirical Risk Minimization or ERM.

However, this approach might lead to overfitting, a common solution is to apply the ERM learning rule over some restricted search space $$\mathcal{H}$$

$ERM_{\mathcal{H}}(S) \in argmin_{h \in \mathcal{H}} \hat{R}(h)$

Such restrictions are often called an inductive bias, which should be based on some prior knowledge about the problem to be learned.

A fundamental question in learning theory is, over which hypothesis classes $$\mathcal{H}$$, will not result in overfitting.

#### 1.4.2. Structural Risk Minimization

Considering an infinite sequence of hypothesis sets with increasing sizes

$H_0 \subset H_1 \subset ... \subset H_n ...$

and find ERM solution for each $$Hn$$, the hypothesis selected is the one with an additional complexity term which depends on the size of $$H_n$$ and sample size $$m$$

$h_S^{SRM} = argmin_{h \in H_n, n \in \mathcal{N}} \hat{R}_S(h) + complexity(H_n, m)$

## 2. Generalization

PAC learning framework above only discuss finite hypothesis classes (i.e. $$|H| < \infty$$), for infinite hypothesis class, it is still available to apply PAC as well.

We need to measure the complexity of hypothesis class (something like $$|H|$$), we can use the following tools

The Rademacher complexity captures the richness of a family of functions by measuring the degree to which a hypothesis set can fit random noise

The Rademacher complexity is defined with respect to the family of functions

Definition (empirical Rademacher complexity) Let $$G$$ be a family of functions mapping from $$Z$$ to $$[a,b]$$, $$S = (z_1, ..., z_m)$$ a fixed sample of size $$m$$ with elements in $$Z$$. Then the empirical Rademacher complexity of $$G$$ with respect to the sample $$S$$ is defined as

$\hat{\mathcal{R}}_S(G) = E_{\sigma} [\sup_{g \in G} \frac{1}{m} \sum_i \sigma_i g(z_i)]$

where $$\sigma=(\sigma_1, ..., \sigma_m)$$ are independent uniform random variable taking values in $$\{ -1, +1 \}$$. Intuitively, it measures the on average how well the function class $$G$$ can correlate with random noise on $$S$$.

There are several good properties of this complexity, first, it can be upper bound with the model size $$|H|$$, often significantly better. It can be used to replace the right side of $$|H|$$ in the PAC bound.

$\hat{\mathcal{R}}_m(H) = \leq \sqrt{\frac{2\log(|H|)}{m}}$

linear models with bounded norm

Cosider $$h(x) = w^Tx$$ with $$||w||_2 \leq A, || x ||_2 \leq B$$, then

$\hat{\mathcal{R}}_m(H) \leq \frac{AB}{\sqrt{m}}$

Complexity increases with the number of parameters $$d$$ and norm of weights

Lemma (Lipschitz composition) If $$\phi$$ is L-lipschitz,

$|\phi(t) - \phi(t')| \leq L||t - t'||$

Then,

$\hat{\mathcal{R}}_m(\phi \circ H) \leq L\hat{\mathcal{R}}_m(H)$

It allows us to map between $$F$$ (loss of model class) and $$H$$ (model class) for Lipschitz losses.

Definition (Rademacher complexity) Let $$D$$ denote the distribution according to which samples are drawn. For any integer $$m \geq 1$$, the Rademacher complexity of $$G$$ is

$\mathcal{R}_m(G) = E_{S \sim D^m}[\hat{\mathcal{R}}_{S}(G)]$

Theorem (Rademacher complexity bound) Let $$G$$ be a family of functions mapping from $$Z$$ to $$[0,1]$$, then for any $$\delta > 0$$ , with probability at least $$1 - \delta$$, each of the following holds for all $$g \in G$$

$E[g(z)] \leq \frac{1}{m}\sum_i g(z_i) + 2\mathcal{R}_m(G) + \sqrt{\frac{\log{1/\delta}}{2m}}$

and

$E[g(z)] \leq \frac{1}{m}\sum_i g(z_i) + 2\hat{\mathcal{R}}_S(G) + 3 \sqrt{\frac{\log{2/\delta}}{2m}}$

For Lipschitz losses (e.g: hinge loss is 1-Lipschitz), it translates to

$|R(h) - \hat{R}(h)| \leq 2L\hat{\mathcal{R}}_S(G) + 3 \sqrt{\frac{\log{2/\delta}}{m}}$

### 2.2. VC Dimension

VC dimension is a purely combinatorial notion, but is easier to compute than the grouth function or Rademacher Complexity.

Definition (shattering) Let $$\{ z_1, ..., z_n \}$$ be a finite set of $$n$$ points. $$\mathcal{A}$$ is a hypothesis set. We let $$N_{\mathcal{A}}(z_1, ..., z_n)$$ be the number of distinct sets in the collection of sets

$|\{ \{ z_1, ..., z_n \} \cap A | A \in \mathcal{A} \}|$

Definition (shattering coefficient) The $$n$$-th shatter coefficient of $$\mathcal{A}$$ is the maximal number of different subsets of $$n$$ points that can be picked out by the collection $$\mathcal{A}$$

$s(\mathcal{A}, n) = \max_{(z_1, ..., z_n)} N_{\mathcal{A}}(z_1, ..., z_n)$

shattering coefficient of left intervals

Consider the collection of left intervals (which is used for CDF)

$\mathcal{A} = \{ (-\infty, t] | t \in R \}$

The shattering coefficient is

$s(\mathcal{A}, n) = n+1$

Theorem (VC theorem) For any distribution $$P$$ and class of sets $$\mathcal{A}$$ we have

$P(\Delta(\mathcal{A}) \geq t) \leq 8s(\mathcal{A}, n)\exp(-nt^2/32)$

where

$\Delta(\mathcal{A}) = \sup_{A \in \mathcal{A}} | P_n(A) - P(A) |$

Ignoring the constant, this gives the bound

$\Delta(\mathcal{A}) \leq \sqrt{\frac{\log(s(\mathcal{A}, n)/\delta)}{n}}$

This inequality reduces the question of uniform convergence to a combinatorial question about the collection of sets.

VC Theorem implies Glivenko-Cantelli Theorem

VC theorem immediately implies GC theorem:

$P(\sup_x |\hat{F}_n(x) - F_X(x)| \geq t) \leq 8(n+1) \exp(-nt^2/32)$

Take the limit shows the uniform convergence

Definition (VC dimension) The VC dimension $$d$$ wrt $$\mathcal{A}$$ is the largest integer $$d$$ which can shatters the $$\mathcal{A}$$

$\max_d \{ d | s(\mathcal{A}, d) = 2^d \}$

VC dimension examples

Here is a list of the VC dimension of some typical classes

class VC dimension
$$\{A_1, ..., A_N \}$$ $$\leq \log_2N$$
$$\{ [a,b]: a, b \in R \}$$ 2
discs in $$R^2$$ 3
closed balls in $$R^d$$ $$\leq d+2$$
rectangles in $$R^d$$ 2d
halfspaces in $$R^d$$ d+1
convex polygons in $$R^2$$ infinity
convex polygons with $$d$$ vertices 2d+1

VC dimension indicates a phase transition of shattering coefficient: once it is no longer exponential, the shattering coefficients become polynomial in $$n$$

Theorem (Saucer's Lemma) If $$\mathcal{A}$$ has finite VC dimension $$d$$, then for $$n > d$$ we have that

$s(\mathcal{A}, n) \leq (n+1)^d$

Combining the Saucer's lemma and VC theorem, we know the empirical process is bounded by

$P(\Delta{\mathcal{A}} \geq t) \leq 8(n+1)^d \exp(-nt^2/32)$

With probability $$1-\delta$$

$\Delta(\mathcal{A}) \leq \sqrt{\frac{32}{n}[d\log(n+1)+\log(8/\delta)]}$

Roughly this is saying for a class with VC dimension $$d$$, we have

$\Delta(\mathcal{A}) \sim \sqrt{\frac{d\log(n)}{n}}$

Binary classification

In the binary classification task, we have a collection of classifier $$\mathcal{F}$$, which induces a set system:

$\{ \{ \{ x: f(x) = 1 \} \times \{ 1 \} \} \cup \{ \{ x : f(x) = 0 \} \times \{ 0 \} \}: f \in \mathcal{F} \}$

If this set has VC dimension $$d$$, then we can apply the VC theorem to obtain the bound. For example, this leads to a uniform convergence guarantee for ERM over linear classifiers.

## 3. Optimizations

Summary of this review paper and this blog

In machine learning, we usually want to compute the gradient of an expectation:

$\eta = \nabla_\theta E_{x \sim p(x; \theta)} f(x)$

The form shows in the policy gradient in the reinforcement learning, variational inference and sensitivity analysis, however it might not be written in the closed form, so not easy to compute.

There are three approach to achieve this by MC samplings

#### 3.1.1. score function estimator

log-derivative trick used in the REINFORCE, rewritten the form as

$\eta = E_{x \sim p(x; \theta)} [f(x) \nabla_\theta \log p(x; \theta)]$

See my REINFORCE note for derivation

#### 3.1.2. pathwise estimator

reparametrization trick used in the VAE, rewritten the form as

$\eta = E_{\epsilon \sim p(\epsilon)} [\nabla_\theta f(g(\epsilon; \theta))]$

Gaussian reparametrization

sample from continuous distribution. it is used in VAE

Gumbel softmax

sample from discrete distribution (multinomial categorical distribution). See the derivation in appendix from the paper

## 4. Optimal Transport

Definition (pushforward measure) Given measurable spaces $$(X_1, \Sigma_1), (X_2, \Sigma_2)$$, a measurable function $$T: X_1 \to X_2$$ and a measure $$\mu_1: \Sigma_1 \to [0, +\infty]$$, the pushforward of $$\mu$$ is defined to be the measure $$T_*(\mu): \Sigma_2 \to [0, +\infty]$$. For all $$B \in \Sigma_2$$

$T_*(\mu)(B) = \mu(T^{-1}(B))$

Intuitively, the optimal transport distance measures how far we have to move the mass of $$P$$ to turn it into $$Q$$

Definition (optimal transport distance, Monge formulation)

$\inf_{T: T_*(P) = Q} \int \| x - T(x) \|^T dP(x)$

A minimizer $$T^*$$ if exists, is called the optimal transport map

This formulation might not always work. A more general formulation is

Definition (Wasserstein distance, Kantorovich formulation) Let $$\mathcal{J}(P, Q)$$ denote all joint distributions $$J$$ for $$(X,Y)$$ that have marginals $$P, Q$$, then the Wasserstein distance is

$W_p(P,Q) = ( \inf_{J \in \mathcal{J}(P,Q)} \int \| x-y \|^p dJ(x,y) )^{1/p}$

when $$p=1$$, this distance is also called the earth mover distance

When $$d=1$$, the distance has a closed form

$W_p(P,Q) = ( \int_0^1 | F^{-1}(z) - G^{-1}(z) |^p )^{1/p}$

When $$P$$ is the empirical distribution of a 1-dimensional sample $$X_1, ..., X_n$$, $$Q$$ is of a sample $$Y_1, ..., Y_n$$, then the distance is easy to compute using the order statistics:

$W_p(P, Q) = ( \sum_i \| X_{(i)} - Y_{(i)} \|^p )^{1/p}$

In higher dimension, it is difficult to compute the distance (except some special cases such as Gaussian), we use Hungarian algorithm with $$O(n^3)$$

Wasserstein distance of Gaussian

If $$P=N(\mu_1, \Sigma_1), Q=N(\mu_2, \Sigma_2)$$, then

$W_2(P,Q) = \| \mu_1 - \mu_2 \|^2 + tr(\Sigma_1) + tr(\Sigma_2) - 2tr[(\Sigma_1^{1/2} \Sigma_2 \Sigma_1^{1/2})^{1/2}]$

## 5. Reference

• [1] Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
• [2] CMU 36705 Intermediate Statistics
• [3] Understanding Machine Learning
• [4] Foundations of Machine Learning