0x502 Efficiency
1. Efficiency
1.1. Architecture Search
Check this blog
1.2. Gradient
Model (Gradient Checkpoint)
- part of the forward memory are wiped out to save memory usage
- those forward weights will be recomputed when necessary during backward
- check the gif here
See here for pytorch's implementation
1.3. Pruning
Model (LOTTERY TICKET HYPOTHESIS) unpruned connection’s value is then reset to its initialization, then retraining
1.4. Distillation
1.5. Ensemble
Model (model soup) averaging the weights of multiple models finetuned with different hyperparameter configurations often improves accuracy and robustness
greedy soups, where models are sequentially added to the soup if they improve accuracy on held-out data, outperforms uniform averaging.
2. Quantization
current NVIDIA Tensor Cores seems to support many precisions: TF32, bfloat16, FP16, FP8 and INT8
2.1. Post-training Quantization
From Tensorflow website, Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency, with little degradation in model accuracy
- simplest form: convert weight to 8bit precision. At the inference time, convert 8bit back to float point and perform float inference
- dynamic range quantization: activations are quantized to 8 bit and computation are done with 8bit precision
- full integer quantization: everything is quantized to integer. A calibration process is needed to estimate the range of float (min, max). Therefore, a representative dataset is needed.
2.2. Quantization-Aware Training
Pro is achieve higher accuracy, Cons are required training pipline, labeled data and hyperparameter tuning.
2.3. Float16 Quantization
Mixed Precision Training
Tensor Core in GPU supports efficient execution of convolution and matrix multiplicat (AB+C) with float16.
Model (mixed precision) forward/backward are done with FP16, grads are then used to update the FP32 ref weights
see Nvidia's manual for more details
One issue in handling 16-bit is overflowing and underflowing illustrated in the following figure. For example, many activation gradient will become 0 due to the FP16's range.
To prevent this issue, a scaling factor should be applied to loss before backprop. This factor can be chosen dynamically.
The overall procedure for training is
Maintain a primary copy of weights in FP32.
For each iteration:
- Make an FP16 copy of the weights.
- Forward propagation (FP16 weights and activations).
- Multiply the resulting loss with the scaling factor S.
- Backward propagation (FP16 weights, activations, and their gradients).
- Multiply the weight gradient with 1/S.
- Complete the weight update (including gradient clipping, etc.).
BF16
Another way is to use bfloat16, which has the same exponent size as float32
2.4. Int8 Quantization
Check this blog