Skip to content

0x502 Efficiency

1. Efficiency

Check this blog

1.2. Gradient

Model (Gradient Checkpoint)

  • part of the forward memory are wiped out to save memory usage
  • those forward weights will be recomputed when necessary during backward
  • check the gif here

See here for pytorch's implementation

1.3. Pruning

Model (LOTTERY TICKET HYPOTHESIS) unpruned connection’s value is then reset to its initialization, then retraining

1.4. Distillation

1.5. Ensemble

Model (model soup) averaging the weights of multiple models finetuned with different hyperparameter configurations often improves accuracy and robustness

greedy soups, where models are sequentially added to the soup if they improve accuracy on held-out data, outperforms uniform averaging.

2. Quantization

current NVIDIA Tensor Cores seems to support many precisions: TF32, bfloat16, FP16, FP8 and INT8


2.1. Post-training Quantization

From Tensorflow website, Post-training quantization is a conversion technique that can reduce model size while also improving CPU and hardware accelerator latency, with little degradation in model accuracy

  • simplest form: convert weight to 8bit precision. At the inference time, convert 8bit back to float point and perform float inference
  • dynamic range quantization: activations are quantized to 8 bit and computation are done with 8bit precision
  • full integer quantization: everything is quantized to integer. A calibration process is needed to estimate the range of float (min, max). Therefore, a representative dataset is needed.

2.2. Quantization-Aware Training

Pro is achieve higher accuracy, Cons are required training pipline, labeled data and hyperparameter tuning.

2.3. Float16 Quantization

Mixed Precision Training

Tensor Core in GPU supports efficient execution of convolution and matrix multiplicat (AB+C) with float16.

Model (mixed precision) forward/backward are done with FP16, grads are then used to update the FP32 ref weights

see Nvidia's manual for more details

One issue in handling 16-bit is overflowing and underflowing illustrated in the following figure. For example, many activation gradient will become 0 due to the FP16's range.


To prevent this issue, a scaling factor should be applied to loss before backprop. This factor can be chosen dynamically.

The overall procedure for training is

Maintain a primary copy of weights in FP32.

For each iteration:

  • Make an FP16 copy of the weights.
  • Forward propagation (FP16 weights and activations).
  • Multiply the resulting loss with the scaling factor S.
  • Backward propagation (FP16 weights, activations, and their gradients).
  • Multiply the weight gradient with 1/S.
  • Complete the weight update (including gradient clipping, etc.).


Another way is to use bfloat16, which has the same exponent size as float32

2.4. Int8 Quantization

Check this blog

2.5. 4-bit Quantization

4. Reference