Skip to content

0x433 Neural Model

1. Foundation Language Model

1.1. Encoder Model (Masked Model)

1.1.1. Traditional Contextual Models

Model (CoVe, Context Vector) Train the contextualized embedding with supervised dataset (i.e translation)

\[\text{CoVe}(x) = \text{BLSTM}(\text{GloVe}(x))\]

use the concatenation of GloVe and CoVe as representation for downstream tasks

Model (ELMo, Embeddings from Language Model) Learns the contextualized word representation using a bidirectional LSTM language model.


The model is to minimize the NLL in both direction

\[\mathcal{L} = \sum_i (\log p(x_i | x_1, ..., x_{i-1}) + \log p(x_i | x_{i+1}, ..., x_n))\]

param (100M)


After training, ELMo obtains representation by combining hidden vectors from both direction and across all layers with a task specific weight \(w_i\)

\[v_i = \sum_{i,l} w_i h_{i,l}\]

Different tasks seem to have different weights across different layers.

contextualized embedding

Topic models and naive embedding models assign fixed embedding or representations to each word. However, the word might have different meaning in different contexts, therefore some recent models are using contextualized representations instead of the fixed embeddings.

The contextualized models usually use generative language modeling or masked language modeling

1.1.2. BERT

Model (BERT) a multi-layer bidirectional Transformer


  • BERT base: 110M (L=12, H=768, A=12), this config is chosen to be the same size as GPT
  • BERT large: 340M (L=24, H=1024, A=16)



There are two approaches to train bert: Masked LM and Next Sentence Prediction. They are used at the same time.

  • Masked LM: 15% randomly mask a word and use other words to predict it. The final hidden of masked words are fed into softmax over vocabulary. cross entropy loss is then applied.
  • next sentence prediction: classify whether one sentence come after the other sentence


Model (SpanBERT) SpanBERT is an improvement of BERT by masking span instead of individual tokens

Objectives are

  • Span Masked LM: similar to the BERT model but a randomly selected span is masked
  • Span Boundary Objective: encourage boundary tokens to predict each word in the span with positional encoding

spanbert Reference: from the SpanBERT paper

Model (RoBERTa) difference between BERT is

  • BERT using static mask (mask are decided during preprocessing), RoBERTa chose mask every epoch dynamically
  • each input containing 512 token can be sampled from contiguous sentence with a separator
  • next sentence prediction task is dropped
  • bigger batch, faster learning rate, bigger training set

1.1.3. Cross-Lingual Model

Model (mBERT) Same arch as BERT, but it is trained on the Wikipedia pages of 104 languages

Analysis (mBERT transfer)

  • approach: fine-tune mBERT using a specific task in one language, but testing it in another language
  • it generalizes well cross-lingual, especially for lexically similar language, but also works for languages using different scripts as well (Urdo written in Arabic, transferred to Hindi written in Devanagari)

Model (monolingual BERT transfer)

  • adapt monolingual bert to a new language by freeze the encoder but retrain the embedding layer

monolingual bert transfer

Model (XLM)

Translation Language Model

  • predict a masked English word by attending both English and French, which encourage to align their representations.


Model (adapter-based transfer, MAD-X) using language, task adapters

1.2. Encoder-Decoder Model

1.2.1. BART

Model (BART)

BART is a denoising encoder-decoder model trained by

  • corrupting text with noise
  • learn a model to reconstruct the original text


BART noise are as follows


1.2.2. T5

Model (T5, Text-to-Text Transfer Transformer)


Also check the Blog

1.3. Decoder Model (Causal Language Model)

1.3.1. GPT

GPT is a language model using transformer. Check Mu Li's video

Model (GPT) 0.1B

Check the next section for details

Model (GPT2) 1.5B

Model (GPT3) 175B

1.3.2. Transformer XL

Model (Transformer XL) overcome the fixed-length context issue by

  • segment-level recurrence: hidden values of the previous segment is cached and provided to the next segment
  • relative positional encoding: use fixed embedding with learnable transformation

See this blog


1.3.3. XLNet

Model (XLNet) Permutation language model

1.3.4. Distributed Models

Model (LaMDA) A decoder only dialog model

  • pretrain on next word prediction
  • fine-tuned using "context sentinel response" format

See this Blog

Model (PaLM, Pathway LM)

See this blog

1.4. Analysis

1.4.1. Scaling

Check this lecture series

Analysis (do you need billions of words of pretraining data) LM requires only 10M or 100M words to learn syntactic/semantic features, a much larger datase (1B, 30B) t is required to acquire common sense knowledge

Analysis (scaling, scaling laws for neural language model) cross-entropy loss scales as a power-law wrt model size, dataset size, computation size:


Aanalysis (U-shape scaling) there are a few tasks that has worse performance with larger models, those tasks, however, actually have the U-scaling curve, where the decreased performance with medium model might be explained by the "distractor task"

1.4.2. Sampling

Model (nucleus sampling)

Both deterministic approach (beam search) and random approach (pure sampling) have cons:

  • beam search: degenerate repetition, it is less surprising
  • pure sampling: incoherent generation

1.4.3. Calibration

Model (confidence calibration) the probability associated with the predicted class label should reflect its ground truth correcteness

Suppose the neural network is \(h(X) = (\hat{Y}, \hat{P})\), where \(\hat{Y}\) is the prediction, \(\hat{P}\) is the associated confidence, a perfect calibration should satisfy

\[P(\hat{Y} = Y | \hat{P} = p) = p\]

A measurement of calibrartion is ECE (Expected Calibration Error) defined as the difference between confidence and actual probability

\[E_{\hat{P}} [ |p (\hat{Y} = Y | \hat{P} = P ) - p| ]\]

Analysis (larger models are well-calibrated) larger models are well-calibrated in the right format

2. Adaptation

2.1. Supervised Tuning

Model (GPT) fine-tune the finaly activation pretrained model with the labeled dataset \((x,y)\)

\[P(y | x_1, ..., x_m) = softmax(h_l^m W_y)\]

The objective is

\[L_2 = \sum_{(x,y)} \log P(y | x_1, ..., x_m)\]

The final objective is to combine language model \(L_1\) as an auxilary objective as well (to help convergence and generalization)

\[L = L_2 + \lambda L_1\]

To transform any other tasks into the classification task, it apples input transformation as follows:


Model (BERT) the downstream task formulation is as follows:

  • sentence -> sentence class: connect the CLS token's embedding with a linear classifier to predict. BERT can be fine-tuned, linear-classifier is trained from scratch

  • sentence -> per word class: connect every word's embedding with a classifier to train.

  • two sentencs -> single class: connect two sentences with SEP token and use the CLI to predict. (e.g. NLI task)

  • sentence -> sentence extraction: If extraction-based QA, suppose document \(D={d_1, d_2, ..., d_N}\) and query \(Q= {q_1, q_2,...,q_M}\), then train a model to use \(D,Q\) to predict two integer \(s,e\) which indicates the answer is \({d_s, ..., d_e}\). \(s,e\) can be found by training two embedding which should be near the target index word's embedding respectively.

2.2. Instruction/Demonstration-Tuning

Dataset (self-instruct) prepare an instruction set in the following manner:

  • prepare some seed tasks and input, output instances
  • prompt seed task to generate more tasks
  • prompt seed task, input, output to generate input/output for the new tasks
  • filtering outputs to encourage diversity

See the appendix for the prompt examples

2.3. Reward Tuning

Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT

Model (Human Preference Reward)

Begin with a autoregressive language model \(\rho\), it can be considered as a policy:

\[\rho(y|x) = \rho(xy)/\rho(x)\]

where \(x\) is a sequence of input, \(y\) is a sequence of output.

We want to finetune policy \(\pi\) from \(\rho\). If reward function \(r: X \times Y \to R\) is defined , then we can use RL to directly optimize the expected reward \(E_\pi(r)\). However, such a reward function might be difficult to design, so we approximate the reward using human labelings

In this work, we ask humans to choose the best option \(b\) from 4 options \(y_0, y_1, y_2, y_3\), then we fit a reward model \(r\) using the following loss

\[E [\log \frac{\exp(r(x, y_b))}{\sum_i \exp(r(x,y_i))}]\]

Then we fine-tune \(\pi\) wrt the reward model \(r\) And also add a penalty to keep \(\pi\) from moving too far from \(\rho\)

The modified reward is

\[R(x,y) = r(x,y) - \beta \log\frac{\pi(y | x)}{\rho(y | x)}\]

Some related implementation can be found in the TRL repo

2.4. Parameter-Efficient Fine-Tuning (PEFT)

check this talk

Let a neural network \(f_\theta: \mathcal{X} \to \mathcal{Y}\) be decomposed into a composition of functions \(f_{\theta_1} \odot f_{\theta_{2}} \dot ... \odot f_{\theta_{l}}\). each has parameters \(\theta_i\)

A module with parameters \(\phi\) can modify the \(i\)-th subfunction as follows:

  • Parameter composition: \(f'_i(x) = f_{\theta_i \oplus \phi}(x)\)
  • Input composition: \(f'_i(x) = f_{\theta_i}([x, \phi])\)
  • Function composition: \(f'_i(x) = f_{\theta_i} \odot f_\phi(x)\)

Parameter Composition

Model (adapter) only add a few trainable parameters per task compared with fine-tuning top-layers


Model (prefix tuning) optimize a small continuous task-specific vectors (i.e. prefix)


Model (LoRA, low rank adaptation) constraining the updated parameter to be low-rank

\[W_0 + \Delta W = W_0 + BA\]

where \(B,A\) has much lower rank

Model (T-few) multiplies activations with learned vectors

Model (setfit) two stage adaptation

  • contrastive fine-tuning
  • head training


2.5. In-Context Learning (prompting)

Survey (Prompt Methods)

pretrain -> prompt -> predict

It has the following steps:

prompt addition: given an input text \(x\), we apply a template to it

\[x' = f_{\text{prompt}}(x)\]

answer search: then we search the text \(z'\) which maximizes the pretrained LM score

\[\hat{z} = \text{search}_{z \in Z} P(f_{\text{fill}}(x', z); \theta)\]

answer mapping: the highest scoring asnswer \(\hat{z}\) is mapped to the highest scoring output \(\hat{y}\)

This survey has a table of many good examples


Model (chain of thought) prompt to show how to reasoning:


3. Task

3.1. Information Retrieval

Classical lexical/statistical IR methods are summarized in the search engine note

3.1.1. Dense Retrieval

Model (ME-BERT) Multi-Vector Encoding

Model (ColBERT) late interaction


3.1.2. Reranking

3.1.3. Generation

Model (promptagator) leverages LLM as a few-shot query generator, and creates task-specific retrievers based on the generated data

Model (WebGPT) allows GPT to search/navigate the web

  • Behavior cloning
  • Reward Modeling
  • Reinforcement learning
  • Rejection sampling


3.2. Translation

Model (Google translation 2017)

Model (knowledge distillation) Distill knowledge from multiple teacher (trained with each lang-pair data separately) to a single multilingual student. The loss is both NLL loss and distillation loss (cross entropy of student/teacher distribution)

Model (massively multilingual model)

Transfer vs Interference:

the goal is to achieve

  • high transfer (positive transfer) to low-resource languages
  • low interference (negative transfer) for high-resource languages.

Sampling strategy:

  • original language distribution has low-transfer/low-interference
  • equal sampling (by upsampling low-resource lang) has high-transfer/high-interference
  • this work suggests using a temperature-based sampling has a good balance over transfer/inteference. Sampling prob is \(p_l^{1/T}\) where \(p_l\) is the original distribution and \(T=5\)

4. Reference

[0] original papers. All images are taken from the original papers or blogs

[1] CMU 11-737 Multilingual Natural Language Processing