0x420 Foundation
1. Theory
Recall from the PAC framework, assume we have a training dataset \(D_n = \{(x_i, y_i)\}^n_{i=1}\) drawn from the joint distribution \(X,Y \sim \mathcal{D}\)
Definition (empirical risk) The empirical risk is defined wrt to a neural network \(h \in \mathcal{H}\)
By trying to minimize the empirical risk, we trained an network \(\hat{\theta}_{\text{ERM}}\) such that
Definition (true risk, generalization error) The true risk is defined wrt to any estimator \(\hat{\theta}\) and the underlying distribution
By considering the Bayes error \(R^*\), which is the achievable lowest error by any measurable function: \(X \to Y\)
Definition (risk decomposition) The risk can be decomposed into
where \(h^* = \inf_{h \in \mathcal{H}} R(h)\)
- approximation error: first term \(|R(h^*) - R^*|\) shows how good is the empirical risk minimization of \(\mathcal{H}\) can fit to the training set \(D\), usually inaccessible
- estimation error second term \(|R(h) - R(h^*)|\) measures the quality of the hypothesis vs the best one in the hypothesis set \(\mathcal{H}\)
Under the ERM framework: \(h = \hat{h}_{\text{ERM}}\), the estimation error can be bound by complexity error: \(\sup_{h \in \mathcal{H}} | R(h) - \hat{R}(h)|\)
The complexity error can be bound by Rademacher complexity, which is further bound by VC dimension, covering number, Dudley integral etc
1.1. Approximation Theory
This subsection is related to the approximation error:
Theorem (universal approximation) MLP are universal approximators. This paper is one of the paper proving this statement
boolean function
An MLP is a universal boolean function
It can represent a given function only if it is sufficiently wide and sufficiently deep. Sometimes the depth can be traded off for exponential growth of width
Optimal width and depth depends on the number of variables and the complexity of the Boolean function
continuous function
Neural network with relu activation can approximate a random continous function. The approximation is achieved by connecting many piecewise linear functions. Intuitively, 1 neuron can provide at least 1 linear piecewise, which might be increased significantly in the deep layer.
In the case of 1 layer shallow network to approxmiate a random L-Lipshitz function, we can break the domain into \(O(L/\epsilon)\) piecewise linear function where \(\epsilon\) is the maximum tolerance.
In the case of deep neural network, we can build following blocks to approximate any random function
- \(y=x^2\) can be efficiently approximated by deep networks
- \(y=x_1 x_2\) can be built from previous block
- \(y=x^n\) can be built from the second block
- A random polynomial can be built from the 3rd block
- Any random continous function can be approximated by the Weierstrass approximation theorem
1.2. Generalization Theory
This subsection is related to the estimation error or its bound (complexity error)
One traditional bound of this is by the Dudley's integral
In neural network, the rate wrt to the number of parameter \(W\) and number of layer \(L\) is
This bound shows larger parameter and layer leads to larger estimation error, which contradicts with the current trend of large models with large parameters.
A few modern directions to interpret the neural network's generalization ability are:
1.2.1. Implicit Regularization
Neural networks can be regularized implicitly in many forms,
For example, by the learning algorithms:
Model (SGD for linear model) For linear models, SGD always converges to a solution with small norm.
Model (SGD for matrix factorization) gradient descent on factorization converges to the minimum nuclear norm solution.
1.2.2. PAC Bayes
1.2.3. Double Descent
This work
2. Representation Learning
Check this chapter of deep learning book
2.1. Discrete Representation
Model (cross entropy, NLL + softmax) Let \(t\) denote the correct label, \(z\) denote logits, then \(y = softmax(z)\), cross entropy optimizes the negative loglikelihood
Taking the derivative leads to a simplified form:
However, cross-entropy have several generalization shortcomings such as
- lack of robustness to noisy labels
- poor margins.
2.2. Continuous Representation
2.2.1. Classical Metric Learning
Check the first part of this slide
Metric learning learns a distance function \(D(x, x'; \theta)\) using datasets such as \(\mathcal{S} = \{ (x_i, x_j): x_i, x_j\text{ are similar} \}, \mathcal{D} = \{ (x_i, x_j): x_i, x_j\text{ are not similar} \}, \mathcal{R} = \{ (x_i, x_j, x_k): x_i\text{ is more similar to } x_j\text{ than to } x_k \}\)
The goal is to solve the optimization problem with some loss function \(l\) and regularization \(reg\)
Distance (Mahalanobis) A classical linear metric learning is to learn a matrix \(L\)
or equivalently using a PSD matrix \(M\)
Model (linear Clustering, Xing et al 2002) The following formulation
Model (large margin nearest neighbor, Weinberg et al 2009) The following SDP
where the slack variable \(\xi_{ijl} \geq 0\) is used to measure the amount by which the large margin inequality
2.2.2. Contrastive Learning
Contrastive learning learns continuous representations \(f(x)\) of input \(x\) from both positive labels and negative labels.
See this blog
Model (contrastive loss) minimize the metric \(\| f_\theta(x_1) - f_\theta(x_2)\|\) when they are from the same class, otherwise maximize.
The most simple way is the linear model \(f_\theta(x) = Lx\) with L2 distance
Model (triplet loss)
Model (NCE loss, noise contrastive estimation) see the bayesian notes
Model (Info NCE, CPC) We want to find a way to preserve mutual information between future representation \(z_t\) and current context representation \(c_t\)
It has three representations:
- raw input \(x_t\)
- local latent feature \(z_t = g_{enc}(x_t)\)
- context feature \(c_t = g_{ar}(z_{<t})\)
CPC tries to learn a representation future \(x\) and context \(c\) such that preserves the mutual information between them
where the density ratio is modeled by
It is then optimized by the NCE loss
Optimizing this loss is equivalent to optimizing categorical cross entropy whose probability is \(P(d=i | X, c_t)\), and optimizing this loss will result in \(f_k(x_{t+k}, c_t)\) estimating the density ratio
Model (supervised contrastive loss) more robust than cross-entropy by pushing embeddings from same class together
2.3. Set Representation
Model (set-input problems) input that are permutation invariant.
A function \(f(x)\) is permutation variant over countable domain can be decomposed into
This is both sufficient and necessary. Proof is simple and available at its appendix
\(\phi, \rho\) can be model with neural network, and can be regarded as encoder, decoder in this work
3. Architecture
3.1. Implicit Models
check the tutorial here
3.1.1. Neural ODE
3.1.2. Deep Equilibirum Model
4. Optimization
This note only covers deep-learning related optimization, this is the note for more general continuous convex optimization discussion
4.1. Online Learning
Unlike PAC learning framework:
- The online learning framework mixes the training and testing instead of separating them
- does not make any distributional assumption (in PAC framework, the data distribution is considered fixed over time)
The performance of online learning algorithms is measured using a mistake model and the notion of regret, which are based on a worst-case or adversarial assumption
The objective is to minimize the cumulative loss \(\sum_{t=1}^T L(\hat{y}_t, y_t)\) over \(T\) round
Definition (regret) regret measure the difference between current model and the best model (inf over i in the following formula)
4.2. Forward Gradient
backprop is unfortunately
- considered as “biologically implausible”
- incompatible with a massive level of model parallelism
Model (weight perturbation, directional gradient descent) update the parameters of the representation by using the directional derivative along a candidate direction
Another relevant work is this one
Model (activity perturbation) yields lower-variance gradient estimates than weight perturbation and provide a continuous-time rate-based interpretation of our algorithm
4.3. Backward Gradient
Most of the popular optimizers are roughly variants of the following framework
General Framework:
- compute grad \(g_t = \nabla f(w_t)\)
- compute 1st and 2nd moments \(m_t, V_t\) based on previous grads \(g_1, ..., g_t\)
- compute updating grad \(g_t = m_t/\sqrt{V_t}\) and grad descent
Model (SGD) no moments
4.3.1. Momentum (1st moments)
Model (SGD with momentum) update momentum as follows
Model (Nesterov Accelerated Gradient, NAG) move first and then compute grad and make a correction, check Hinton's slide
4.3.2. Adaptive learning (2nd moments, magnitude)
Model (adagrad) works in sparse dataset. using magnitude to change the learning rate: \(\alpha \to \alpha / \sqrt{V_t}\)
Model (RMSProp) using moving average for magnitude
Model (adam) RMSProp + Momentum with the unbiased adjustion
Model (nadam) adam + nesterov
Model (adamw) adam + weight decay
4.3.3. Batching
This paper shows the batch size should be proportional to learning rate
Analysis (large batch is bad at generalization) This work claims large-batch methods tend to converge to sharp minimizers, which has poorer generalization.
4.4. Hessian Methods
Even hessian is not explicitly computed, information of Hessian such as trace, spectrum can be extracted using numerical algebra or its randomized version.
These information is helpful, for example, to analyze model's generalization
4.4.1. Matrix-free Methods
4.4.1.1. Trace
Suppose we can query \(Ax\) without knowing \(A \in \mathbb{R}^{n \times n}\) explicitly, we want to approximate \(tr(A)\)
One simple idea is to
which requires \(n\) queries and might not be efficient
Model (Hutchinson's method)
draw \(x_1, ..., x_m\) from random \(\{ -1, 1 \}\) (i.e., Rademacher distribution), then the estimator of \(tr(A)\) is
The validity of this estimator can be derived using Hanson-Wright inequality, which states the concentration result for quadratic forms in sub-Gaussian random variables
4.4.1.2. Spectrum
Model (Stochastic Lanczos Quadrature)
4.4.2. Loss Topology
landscape, flatness, sharpness
Model (sharpness, Keskar) largest eigenvalue of loss to characterize the sharpness, large batch size tends to generalize less well
4.5. Bayesian Optimization
Bayesian Optimization can be used to find hyperparameters. It builds a probabilistic proxy model for the objective using outcomes of past experiments as training data.
5. Regularization
5.1. Dropout
5.2. Normalization
Model (batch norm) normalize over the batch dim
where \(\gamma, \beta\) are learnable parameters
Benefits of using batch norm are
- reduce covariate shift as claimed by the original paper
- enable larger learning rate without exploision, shown by this work
- makes the optimization landscape significantly smoother: improvement in the Lipschitzness of the loss function as shown by this work
Batch norm is not appropriate when
- used in RNN where stats across timestamps vary
- batch size is small
- reduces robustness to small adversarial input perturbations as shown by this work
Model (weight norm) normalize inputs with L2 norm of weights
Model (layer norm) normalize over the hidden dim
Model (group norm) divides the channels into groups and computes within each group the mean and variance for normalization
5.3. Label Smoothing
Model (confidence penalty) penalizing low-entropy