Skip to content

0x551 Decoder

1. Encoder-Decoder Model

1.1. BART

Model (BART)

BART is a denoising encoder-decoder model trained by

  • corrupting text with noise
  • learn a model to reconstruct the original text

bart

BART noise are as follows

bart_noise

1.2. T5

Model (T5, Text-to-Text Transfer Transformer)

T5

Also check the Blog

2. Decoder Model (Causal Language Model)

2.1. GPT

GPT is a language model using transformer. Check Mu Li's video

Model (GPT) 0.1B

Check the next section for details

Model (GPT2) 1.5B

Model (GPT3) 175B

2.2. Transformer XL

Model (Transformer XL) overcome the fixed-length context issue by

  • segment-level recurrence: hidden values of the previous segment is cached and provided to the next segment
  • relative positional encoding: use fixed embedding with learnable transformation

See this blog

transformer_xl

2.3. XLNet

Model (XLNet) Permutation language model

2.4. Distributed Models

Model (LaMDA) A decoder only dialog model

  • pretrain on next word prediction
  • fine-tuned using "context sentinel response" format

See this Blog

Model (PaLM, Pathway LM)

See this blog

3. Decoding

Naive ways to generate are

  • greedy search: pickup the highest probability at each timestamp
  • beam search: keep the top-k most likely hypothesis at each timestamp

Model (grid beam search, lexical constraints) extend beam search to allow lexical-constraints with a new axis

check this huggingface blog

grid_beam

Model (nucleus sampling, top-p sampling) top-p sampling build the top candidates based on cumulative probabilities crossing a threshold:

\[\sum_{x \in V^{(p)}} p(x | x_{1:i-1}) > p\]

then the probability mass is distributed within this set. Unlike a fixed top-k sampling, this is more adaptive to different distribution.

Speculative Decoding (Leviathan et al., 2023)1 sample an episode from a small model of \(q(x)\) and use the correct (large) \(p\) to decide which timestep to cut off

Model (Jacobi decoding) iteratively decode the entire sequence until convergence. This can be enhanced by combining with n-gram trajectory, lookahead decoding

4. Calibration

Model (confidence calibration) the probability associated with the predicted class label should reflect its ground truth correcteness

Suppose the neural network is \(h(X) = (\hat{Y}, \hat{P})\), where \(\hat{Y}\) is the prediction, \(\hat{P}\) is the associated confidence, a perfect calibration should satisfy

\[P(\hat{Y} = Y | \hat{P} = p) = p\]

A measurement of calibrartion is ECE (Expected Calibration Error) defined as the difference between confidence and actual probability

\[E_{\hat{P}} [ |p (\hat{Y} = Y | \hat{P} = P ) - p| ]\]

Analysis (larger models are well-calibrated) larger models are well-calibrated in the right format


  1. Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. In International conference on machine learning, pages 19274–19286. PMLR.