# 0x420 Foundation

## 1. Theory

Recall from the PAC framework, assume we have a training dataset $$D_n = \{(x_i, y_i)\}^n_{i=1}$$ drawn from the joint distribution $$X,Y \sim \mathcal{D}$$

Definition (empirical risk) The empirical risk is defined wrt to a neural network $$h \in \mathcal{H}$$

$\hat{R}(h) = \frac{1}{n} \sum_i l(y_i, h(x_i))$

By trying to minimize the empirical risk, we trained an network $$\hat{\theta}_{\text{ERM}}$$ such that

$\hat{h}_{\text{ERM}} = \text{argmin}_{h \in \mathcal{H}} \hat{R}(h)$

Definition (true risk, generalization error) The true risk is defined wrt to any estimator $$\hat{\theta}$$ and the underlying distribution

$R(\hat{h}) = E_{X,Y \sim \mathcal{D}} [l(Y, f_{\hat{h}}(X))]$

By considering the Bayes error $$R^*$$, which is the achievable lowest error by any measurable function: $$X \to Y$$

$R^* = \inf_{h \in \mathcal{H}: X \to Y} R(h)$

Definition (risk decomposition) The risk can be decomposed into

$R(h) - R^* = |R(h^*) - R^* | + |R(h) - R(h^*)|$

where $$h^* = \inf_{h \in \mathcal{H}} R(h)$$

• approximation error: first term $$|R(h^*) - R^*|$$ shows how good is the empirical risk minimization of $$\mathcal{H}$$ can fit to the training set $$D$$, usually inaccessible
• estimation error second term $$|R(h) - R(h^*)|$$ measures the quality of the hypothesis vs the best one in the hypothesis set $$\mathcal{H}$$

Under the ERM framework: $$h = \hat{h}_{\text{ERM}}$$, the estimation error can be bound by complexity error: $$\sup_{h \in \mathcal{H}} | R(h) - \hat{R}(h)|$$

\begin{aligned}R(\hat{h}_{\text{ERM}}) - R(h^*) &\leq (R(\hat{h}_{\text{ERM}}) - \hat{R}(\hat{h}_{\text{ERM}})) + (\hat{R}({h}_{\text{ERM}}) - \hat{R}(h^*)) + (\hat{R}(h^*) - R(h^*)) \\ &= 2\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| \end{aligned}

The complexity error can be bound by Rademacher complexity, which is further bound by VC dimension, covering number, Dudley integral etc

### 1.1. Approximation Theory

This subsection is related to the approximation error:

$|R(h^*) - R^*|$

Theorem (universal approximation) MLP are universal approximators. This paper is one of the paper proving this statement

boolean function

An MLP is a universal boolean function

It can represent a given function only if it is sufficiently wide and sufficiently deep. Sometimes the depth can be traded off for exponential growth of width

Optimal width and depth depends on the number of variables and the complexity of the Boolean function

continuous function

Neural network with relu activation can approximate a random continous function. The approximation is achieved by connecting many piecewise linear functions. Intuitively, 1 neuron can provide at least 1 linear piecewise, which might be increased significantly in the deep layer.

In the case of 1 layer shallow network to approxmiate a random L-Lipshitz function, we can break the domain into $$O(L/\epsilon)$$ piecewise linear function where $$\epsilon$$ is the maximum tolerance.

In the case of deep neural network, we can build following blocks to approximate any random function

• $$y=x^2$$ can be efficiently approximated by deep networks
• $$y=x_1 x_2$$ can be built from previous block
• $$y=x^n$$ can be built from the second block
• A random polynomial can be built from the 3rd block
• Any random continous function can be approximated by the Weierstrass approximation theorem

### 1.2. Generalization Theory

This subsection is related to the estimation error or its bound (complexity error)

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)|$

One traditional bound of this is by the Dudley's integral

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| = O(\frac{1}{\sqrt{n}} \int_0^\infty \sqrt{\log N_\delta} d\delta)$

In neural network, the rate wrt to the number of parameter $$W$$ and number of layer $$L$$ is

$\sup_{h \in \mathcal{H}} |R(h) - \hat{R}(h)| = O(\sqrt{\frac{W \log L}{n}})$

This bound shows larger parameter and layer leads to larger estimation error, which contradicts with the current trend of large models with large parameters.

A few modern directions to interpret the neural network's generalization ability are:

#### 1.2.1. Implicit Regularization

Neural networks can be regularized implicitly in many forms,

For example, by the learning algorithms:

Model (SGD for linear model) For linear models, SGD always converges to a solution with small norm.

Model (SGD for matrix factorization) gradient descent on factorization converges to the minimum nuclear norm solution.

This work

## 2. Representation Learning

Check this chapter of deep learning book

### 2.1. Discrete Representation

Model (cross entropy, NLL + softmax) Let $$t$$ denote the correct label, $$z$$ denote logits, then $$y = softmax(z)$$, cross entropy optimizes the negative loglikelihood

$l = -\log P(t | z)$

Taking the derivative leads to a simplified form:

$\frac{\partial{l}}{\partial z} = y - t$

However, cross-entropy have several generalization shortcomings such as

• lack of robustness to noisy labels
• poor margins.

### 2.2. Continuous Representation

#### 2.2.1. Classical Metric Learning

Check the first part of this slide

Metric learning learns a distance function $$D(x, x'; \theta)$$ using datasets such as $$\mathcal{S} = \{ (x_i, x_j): x_i, x_j\text{ are similar} \}, \mathcal{D} = \{ (x_i, x_j): x_i, x_j\text{ are not similar} \}, \mathcal{R} = \{ (x_i, x_j, x_k): x_i\text{ is more similar to } x_j\text{ than to } x_k \}$$

The goal is to solve the optimization problem with some loss function $$l$$ and regularization $$reg$$

$\text{argmin}_\theta [ l(\theta, \mathcal{S}, \mathcal{D}, \mathcal{R}) + reg(\theta)]$

Distance (Mahalanobis) A classical linear metric learning is to learn a matrix $$L$$

$D(x, x'; L) = \sqrt{(Lx - Lx')^T(Lx - Lx')}$

or equivalently using a PSD matrix $$M$$

$D(x, x'; M) = \sqrt{(x-x')^TM(x-x')}$

Model (linear Clustering, Xing et al 2002) The following formulation

\begin{aligned}\max \sum_{(x_i, x_j) \in \mathcal{D}} D(x_i, x_j) \\ s.t. \sum_{(x_i, x_j) \in \mathcal{S}} D^2(x_i, x_j) \leq 1 \end{aligned}

Model (large margin nearest neighbor, Weinberg et al 2009) The following SDP

\begin{aligned}\min (1-\mu) \sum_{(x_i, x_j) \in \mathcal{S}} D^2(x_i, x_j) + \mu \sum_{i,j,k} \xi_{i,j,k}\\ s.t. D^2(x_i, x_k) - D^2(x_i, x_j) \leq 1 - \xi_{i,j,k} \end{aligned}

where the slack variable $$\xi_{ijl} \geq 0$$ is used to measure the amount by which the large margin inequality

#### 2.2.2. Contrastive Learning

Contrastive learning learns continuous representations $$f(x)$$ of input $$x$$ from both positive labels and negative labels.

See this blog

Model (contrastive loss) minimize the metric $$\| f_\theta(x_1) - f_\theta(x_2)\|$$ when they are from the same class, otherwise maximize.

The most simple way is the linear model $$f_\theta(x) = Lx$$ with L2 distance

Model (triplet loss)

Model (NCE loss, noise contrastive estimation) see the bayesian notes

Model (Info NCE, CPC) We want to find a way to preserve mutual information between future representation $$z_t$$ and current context representation $$c_t$$

It has three representations:

• raw input $$x_t$$
• local latent feature $$z_t = g_{enc}(x_t)$$
• context feature $$c_t = g_{ar}(z_{<t})$$

CPC tries to learn a representation future $$x$$ and context $$c$$ such that preserves the mutual information between them

$I(x; c) = \sum_{x, c} p(x,c) \log \frac{p(x,c)}{p(x)p(c)} = \sum_{x, c} p(x,c) \log \frac{p(x|c)}{p(x)}$

where the density ratio is modeled by

$f(x_{t+k, c_t}) = \exp(z^T_{t+k}W_k c_t) \propto \frac{p(x|c)}{p(x)}$

It is then optimized by the NCE loss

$\mathcal{L} = E_X [{\log \frac{f_k(x_{t+k}, c_t)}{\sum_j f_k(x_j, c_t)}}]$

Optimizing this loss is equivalent to optimizing categorical cross entropy whose probability is $$P(d=i | X, c_t)$$, and optimizing this loss will result in $$f_k(x_{t+k}, c_t)$$ estimating the density ratio

Model (supervised contrastive loss) more robust than cross-entropy by pushing embeddings from same class together

### 2.3. Set Representation

Model (set-input problems) input that are permutation invariant.

A function $$f(x)$$ is permutation variant over countable domain can be decomposed into

$f(X) = \rho (\sum_{x \in X} \phi(x))$

This is both sufficient and necessary. Proof is simple and available at its appendix

$$\phi, \rho$$ can be model with neural network, and can be regarded as encoder, decoder in this work

## 3. Architecture

### 3.1. Implicit Models

check the tutorial here

This work

This work

## 4. Optimization

This note only covers deep-learning related optimization, this is the note for more general continuous convex optimization discussion

### 4.1. Online Learning

Unlike PAC learning framework:

• The online learning framework mixes the training and testing instead of separating them
• does not make any distributional assumption (in PAC framework, the data distribution is considered fixed over time)

The performance of online learning algorithms is measured using a mistake model and the notion of regret, which are based on a worst-case or adversarial assumption

The objective is to minimize the cumulative loss $$\sum_{t=1}^T L(\hat{y}_t, y_t)$$ over $$T$$ round

Definition (regret) regret measure the difference between current model and the best model (inf over i in the following formula)

$R_T = \sum_{t=1}^T L(\hat{y}_t - y_t) - \inf_i \sum_{t=1}^T L(\hat{y}_{t, i}, y_t)$

backprop is unfortunately

• considered as “biologically implausible”
• incompatible with a massive level of model parallelism

Model (weight perturbation, directional gradient descent) update the parameters of the representation by using the directional derivative along a candidate direction

Another relevant work is this one

Model (activity perturbation) yields lower-variance gradient estimates than weight perturbation and provide a continuous-time rate-based interpretation of our algorithm

Most of the popular optimizers are roughly variants of the following framework

General Framework:

• compute grad $$g_t = \nabla f(w_t)$$
• compute 1st and 2nd moments $$m_t, V_t$$ based on previous grads $$g_1, ..., g_t$$
• compute updating grad $$g_t = m_t/\sqrt{V_t}$$ and grad descent

Model (SGD) no moments

#### 4.3.1. Momentum (1st moments)

Model (SGD with momentum) update momentum as follows

$m_t = \beta m_{t-1} + (1-\beta) g_t$

Model (Nesterov Accelerated Gradient, NAG) move first and then compute grad and make a correction, check Hinton's slide

$g_t = \nabla f(w_t - \alpha m_t)$

#### 4.3.2. Adaptive learning (2nd moments, magnitude)

Model (adagrad) works in sparse dataset. using magnitude to change the learning rate: $$\alpha \to \alpha / \sqrt{V_t}$$

$V_t = \sum_t g^2_t$

Model (RMSProp) using moving average for magnitude

$V_t = \beta V_{t-1} + (1-\beta) g^2_t$

#### 4.3.3. Batching

This paper shows the batch size should be proportional to learning rate

Analysis (large batch is bad at generalization) This work claims large-batch methods tend to converge to sharp minimizers, which has poorer generalization.

### 4.4. Hessian Methods

Even hessian is not explicitly computed, information of Hessian such as trace, spectrum can be extracted using numerical algebra or its randomized version.

These information is helpful, for example, to analyze model's generalization

#### 4.4.1. Matrix-free Methods

##### 4.4.1.1. Trace

Suppose we can query $$Ax$$ without knowing $$A \in \mathbb{R}^{n \times n}$$ explicitly, we want to approximate $$tr(A)$$

One simple idea is to

$tr(A) = \sum_{i=1}^n e_i A e_i$

which requires $$n$$ queries and might not be efficient

Model (Hutchinson's method)

draw $$x_1, ..., x_m$$ from random $$\{ -1, 1 \}$$ (i.e., Rademacher distribution), then the estimator of $$tr(A)$$ is

$\widehat{tr(A)} = \frac{1}{m} \sum_{i=1}^m x_j^T A x_j$

The validity of this estimator can be derived using Hanson-Wright inequality, which states the concentration result for quadratic forms in sub-Gaussian random variables

#### 4.4.2. Loss Topology

landscape, flatness, sharpness

Model (sharpness, Keskar) largest eigenvalue of loss to characterize the sharpness, large batch size tends to generalize less well

### 4.5. Bayesian Optimization

Bayesian Optimization can be used to find hyperparameters. It builds a probabilistic proxy model for the objective using outcomes of past experiments as training data.

## 5. Regularization

### 5.2. Normalization

Model (batch norm) normalize over the batch dim

$y = \frac{x - E[x]}{\sqrt{Var(x) + \epsilon}} \gamma + \beta$

where $$\gamma, \beta$$ are learnable parameters

Benefits of using batch norm are

• reduce covariate shift as claimed by the original paper
• enable larger learning rate without exploision, shown by this work
• makes the optimization landscape significantly smoother: improvement in the Lipschitzness of the loss function as shown by this work

Batch norm is not appropriate when

• used in RNN where stats across timestamps vary
• batch size is small
• reduces robustness to small adversarial input perturbations as shown by this work

Model (weight norm) normalize inputs with L2 norm of weights

Model (layer norm) normalize over the hidden dim

Model (group norm) divides the channels into groups and computes within each group the mean and variance for normalization

### 5.3. Label Smoothing

Model (confidence penalty) penalizing low-entropy

$\mathcal{L}(\theta) = -\sum \log p_\theta (y|x) - \beta H(p_\theta(y|x))$