0x501 Optimization

1. Optimization
2. Regularization
3. Reference

1. Optimization

This note only covers deep-learning related optimization, this is the note for more general continuous convex optimization discussion

1.1. Online Learning

Unlike PAC learning framework:

The online learning framework mixes the training and testing instead of separating them
does not make any distributional assumption (in PAC framework, the data distribution is considered fixed over time)

The performance of online learning algorithms is measured using a mistake model and the notion of regret, which are based on a worst-case or adversarial assumption

The objective is to minimize the cumulative loss \(\sum_{t=1}^T L(\hat{y}_t, y_t)\) over \(T\) round

Definition (regret) regret measure the difference between current model and the best model (inf over i in the following formula)

\[R_T = \sum_{t=1}^T L(\hat{y}_t - y_t) - \inf_i \sum_{t=1}^T L(\hat{y}_{t, i}, y_t)\]

1.2. Forward Gradient

backprop is unfortunately

considered as “biologically implausible”
incompatible with a massive level of model parallelism

Model (weight perturbation, directional gradient descent) update the parameters of the representation by using the directional derivative along a candidate direction

Another relevant work is this one

Model (activity perturbation) yields lower-variance gradient estimates than weight perturbation and provide a continuous-time rate-based interpretation of our algorithm

1.3. Backward Gradient

Most of the popular optimizers are roughly variants of the following framework

General Framework:

compute grad \(g_t = \nabla f(w_t)\)
compute 1^st and 2^nd moments \(m_t, V_t\) based on previous grads \(g_1, ..., g_t\)
compute updating grad \(g_t = m_t/\sqrt{V_t}\) and grad descent

Model (SGD) no moments

1.3.1. Momentum (1^st moments)

Model (SGD with momentum) update momentum as follows

\[m_t = \beta m_{t-1} + (1-\beta) g_t\]

Model (Nesterov Accelerated Gradient, NAG) move first and then compute grad and make a correction, check Hinton's slide

\[g_t = \nabla f(w_t - \alpha m_t)\]

1.3.2. Adaptive learning (2^nd moments, magnitude)

Model (adagrad) works in sparse dataset. using magnitude to change the learning rate: \(\alpha \to \alpha / \sqrt{V_t}\)

\[V_t = \sum_t g^2_t\]

Model (RMSProp) using moving average for magnitude

\[V_t = \beta V_{t-1} + (1-\beta) g^2_t\]

Model (adam) RMSProp + Momentum with the unbiased adjustion

Model (nadam) adam + nesterov

Model (adamw) adam + weight decay

1.3.3. Batching

Analysis (large batch is bad at generalization) This work claims large-batch methods tend to converge to sharp minimizers, which has poorer generalization.

Goyal et al. 2017 shows the learning rate should be scaled linearly with the global batch size together with a warmup stage. There are also discussion that learning rate should be sqrt of the batch size to keep the variance

This paper suggests increasing the batch size instead of decreasing learning rate

1.4. Hessian Methods

Even hessian is not explicitly computed, information of Hessian such as trace, spectrum can be extracted using numerical algebra or its randomized version.

These information is helpful, for example, to analyze model's generalization

1.4.1. Matrix-free Methods

1.4.1.1. Trace

Suppose we can query \(Ax\) without knowing \(A \in \mathbb{R}^{n \times n}\) explicitly, we want to approximate \(tr(A)\)

One simple idea is to

\[tr(A) = \sum_{i=1}^n e_i A e_i\]

which requires \(n\) queries and might not be efficient

Model (Hutchinson's method)

draw \(x_1, ..., x_m\) from random \(\{ -1, 1 \}\) (i.e., Rademacher distribution), then the estimator of \(tr(A)\) is

\[\widehat{tr(A)} = \frac{1}{m} \sum_{i=1}^m x_j^T A x_j\]

The validity of this estimator can be derived using Hanson-Wright inequality, which states the concentration result for quadratic forms in sub-Gaussian random variables

1.4.1.2. Spectrum

Model (Stochastic Lanczos Quadrature)

1.4.2. Loss Topology

landscape, flatness, sharpness

Model (sharpness, Keskar) largest eigenvalue of loss to characterize the sharpness, large batch size tends to generalize less well

sharp minimum

1.5. Bayesian Optimization

Bayesian Optimization can be used to find hyperparameters. It builds a probabilistic proxy model for the objective using outcomes of past experiments as training data.

1.6. Others

Model (meta learning, MAML) optimize parameter to minimize 1-step further loss over a new task

\[\theta = \theta - \beta \nabla_\theta \sum_{\tau_i} \mathcal{L}_{\tau_i}(f_{\theta'_i})\]

where \(\theta'_i\) is one step updated parameter

2. Regularization

2.1. Dropout

Model (stochastic depth) randomly drop at the layer-level with the dropping probability linearly changes wrt depth

stochastic depth

2.2. Normalization

Batch Norm normalize over the batch dim

\[y = \frac{x - E[x]}{\sqrt{Var(x) + \epsilon}} \gamma + \beta\]

where \(\gamma, \beta\) are learnable parameters

Benefits of using batch norm are

reduce covariate shift as claimed by the original paper
enable larger learning rate without exploision, shown by this work
makes the optimization landscape significantly smoother: improvement in the Lipschitzness of the loss function as shown by this work

Batch norm is not appropriate when

used in RNN where stats across timestamps vary
batch size is small, try using batch renormalization instead (with the cost of more hyperparameter)
reduces robustness to small adversarial input perturbations as shown by this work

WeightNorm normalize inputs with L2 norm of weights instead of inputs (i.e. \(\mathbf{w} = g \mathbf{v}/|\mathbf{v}|\)) but is less stable so not widely used in practice.

LayerNorm normalize over the layer dim by leveraing re-centering and re-scaling invariance. RMSNorm argues re-centering invariance is not important and it only aplies re-scaling.

Group Norm divides the channels into groups and computes within each group the mean and variance for normalization

2.3. Data Smoothing

Model (label smoothing)

Model (confidence penalty) penalizing low-entropy

\[\mathcal{L}(\theta) = -\sum \log p_\theta (y|x) - \beta H(p_\theta(y|x))\]

Model (mixup) Instead of minimizing empirical risk, mixup minimizes Vicinal Risk (introduced in this paper), where virtual examples are drawn from vicinity distribution

\[P_\nu = \frac{1}{n} \sum_i \nu(\tilde x, \tilde y | x_i, y_i)\]

where \((\tilde x, \tilde y)\) are in the vicinity of the training feature-target pair \((x_i, y_i)\).

In this work, \((\tilde x, \tilde y)\) was implemented as a linear combination of two data points \((x_1,y_1)\) and \((x_2, y_2)\)

\[\tilde x = \lambda x_1 + (1-\lambda) x_2, \tilde y = \lambda y_1 + (1-\lambda) y_2\]

where \(\lambda\) is sampled from beta distribution

Model (cutmix) see the following diagram

cutmix