0x433 Neural Model
- 1. Foundation Language Model
- 2. Adaptation
- 3. Task
- 4. Reference
1. Foundation Language Model
1.1. Encoder Model (Masked Model)
1.1.1. Traditional Contextual Models
Model (CoVe, Context Vector) Train the contextualized embedding with supervised dataset (i.e translation)
use the concatenation of GloVe and CoVe as representation for downstream tasks
Model (ELMo, Embeddings from Language Model) Learns the contextualized word representation using a bidirectional LSTM language model.
pretraining
The model is to minimize the NLL in both direction
param (100M)
downstream
After training, ELMo obtains representation by combining hidden vectors from both direction and across all layers with a task specific weight \(w_i\)
Different tasks seem to have different weights across different layers.
contextualized embedding
Topic models and naive embedding models assign fixed embedding or representations to each word. However, the word might have different meaning in different contexts, therefore some recent models are using contextualized representations instead of the fixed embeddings.
The contextualized models usually use generative language modeling or masked language modeling
1.1.2. BERT
Model (BERT) a multi-layer bidirectional Transformer
Config:
- BERT base: 110M (L=12, H=768, A=12), this config is chosen to be the same size as GPT
- BERT large: 340M (L=24, H=1024, A=16)
Pretraining
There are two approaches to train bert: Masked LM and Next Sentence Prediction. They are used at the same time.
- Masked LM: 15% randomly mask a word and use other words to predict it. The final hidden of masked words are fed into softmax over vocabulary. cross entropy loss is then applied.
- next sentence prediction: classify whether one sentence come after the other sentence
Model (SpanBERT) SpanBERT is an improvement of BERT by masking span instead of individual tokens
Objectives are
- Span Masked LM: similar to the BERT model but a randomly selected span is masked
- Span Boundary Objective: encourage boundary tokens to predict each word in the span with positional encoding
Reference: from the SpanBERT paper
Model (RoBERTa) difference between BERT is
- BERT using static mask (mask are decided during preprocessing), RoBERTa chose mask every epoch dynamically
- each input containing 512 token can be sampled from contiguous sentence with a separator
- next sentence prediction task is dropped
- bigger batch, faster learning rate, bigger training set
1.1.3. Cross-Lingual Model
Model (mBERT) Same arch as BERT, but it is trained on the Wikipedia pages of 104 languages
Analysis (mBERT transfer)
- approach: fine-tune mBERT using a specific task in one language, but testing it in another language
- it generalizes well cross-lingual, especially for lexically similar language, but also works for languages using different scripts as well (Urdo written in Arabic, transferred to Hindi written in Devanagari)
Model (monolingual BERT transfer)
- adapt monolingual bert to a new language by freeze the encoder but retrain the embedding layer
Model (XLM)
Translation Language Model
- predict a masked English word by attending both English and French, which encourage to align their representations.
Model (adapter-based transfer, MAD-X) using language, task adapters
1.2. Encoder-Decoder Model
1.2.1. BART
Model (BART)
BART is a denoising encoder-decoder model trained by
- corrupting text with noise
- learn a model to reconstruct the original text
BART noise are as follows
1.2.2. T5
Model (T5, Text-to-Text Transfer Transformer)
Also check the Blog
1.3. Decoder Model (Causal Language Model)
1.3.1. GPT
GPT is a language model using transformer. Check Mu Li's video
Model (GPT) 0.1B
Check the next section for details
Model (GPT2) 1.5B
Model (GPT3) 175B
1.3.2. Transformer XL
Model (Transformer XL) overcome the fixed-length context issue by
- segment-level recurrence: hidden values of the previous segment is cached and provided to the next segment
- relative positional encoding: use fixed embedding with learnable transformation
See this blog
1.3.3. XLNet
Model (XLNet) Permutation language model
1.3.4. Distributed Models
Model (LaMDA) A decoder only dialog model
- pretrain on next word prediction
- fine-tuned using "context sentinel response" format
See this Blog
Model (PaLM, Pathway LM)
See this blog
1.4. Analysis
1.4.1. Scaling
Check this lecture series
Analysis (do you need billions of words of pretraining data) LM requires only 10M or 100M words to learn syntactic/semantic features, a much larger datase (1B, 30B) t is required to acquire common sense knowledge
Analysis (scaling, scaling laws for neural language model) cross-entropy loss scales as a power-law wrt model size, dataset size, computation size:
Aanalysis (U-shape scaling) there are a few tasks that has worse performance with larger models, those tasks, however, actually have the U-scaling curve, where the decreased performance with medium model might be explained by the "distractor task"
1.4.2. Sampling
Model (nucleus sampling)
Both deterministic approach (beam search) and random approach (pure sampling) have cons:
- beam search: degenerate repetition, it is less surprising
- pure sampling: incoherent generation
1.4.3. Calibration
Model (confidence calibration) the probability associated with the predicted class label should reflect its ground truth correcteness
Suppose the neural network is \(h(X) = (\hat{Y}, \hat{P})\), where \(\hat{Y}\) is the prediction, \(\hat{P}\) is the associated confidence, a perfect calibration should satisfy
A measurement of calibrartion is ECE (Expected Calibration Error) defined as the difference between confidence and actual probability
Analysis (larger models are well-calibrated) larger models are well-calibrated in the right format
2. Adaptation
2.1. Supervised Tuning
Model (GPT) fine-tune the finaly activation pretrained model with the labeled dataset \((x,y)\)
The objective is
The final objective is to combine language model \(L_1\) as an auxilary objective as well (to help convergence and generalization)
To transform any other tasks into the classification task, it apples input transformation as follows:
Model (BERT) the downstream task formulation is as follows:
-
sentence -> sentence class: connect the CLS token's embedding with a linear classifier to predict. BERT can be fine-tuned, linear-classifier is trained from scratch
-
sentence -> per word class: connect every word's embedding with a classifier to train.
-
two sentencs -> single class: connect two sentences with SEP token and use the CLI to predict. (e.g. NLI task)
-
sentence -> sentence extraction: If extraction-based QA, suppose document \(D={d_1, d_2, ..., d_N}\) and query \(Q= {q_1, q_2,...,q_M}\), then train a model to use \(D,Q\) to predict two integer \(s,e\) which indicates the answer is \({d_s, ..., d_e}\). \(s,e\) can be found by training two embedding which should be near the target index word's embedding respectively.
2.2. Instruction/Demonstration-Tuning
Dataset (self-instruct) prepare an instruction set in the following manner:
- prepare some seed tasks and input, output instances
- prompt seed task to generate more tasks
- prompt seed task, input, output to generate input/output for the new tasks
- filtering outputs to encourage diversity
See the appendix for the prompt examples
2.3. Reward Tuning
Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT
Model (Human Preference Reward)
Begin with a autoregressive language model \(\rho\), it can be considered as a policy:
where \(x\) is a sequence of input, \(y\) is a sequence of output.
We want to finetune policy \(\pi\) from \(\rho\). If reward function \(r: X \times Y \to R\) is defined , then we can use RL to directly optimize the expected reward \(E_\pi(r)\). However, such a reward function might be difficult to design, so we approximate the reward using human labelings
In this work, we ask humans to choose the best option \(b\) from 4 options \(y_0, y_1, y_2, y_3\), then we fit a reward model \(r\) using the following loss
Then we fine-tune \(\pi\) wrt the reward model \(r\) And also add a penalty to keep \(\pi\) from moving too far from \(\rho\)
The modified reward is
Some related implementation can be found in the TRL repo
2.4. Parameter-Efficient Fine-Tuning (PEFT)
check this talk
Let a neural network \(f_\theta: \mathcal{X} \to \mathcal{Y}\) be decomposed into a composition of functions \(f_{\theta_1} \odot f_{\theta_{2}} \dot ... \odot f_{\theta_{l}}\). each has parameters \(\theta_i\)
A module with parameters \(\phi\) can modify the \(i\)-th subfunction as follows:
- Parameter composition: \(f'_i(x) = f_{\theta_i \oplus \phi}(x)\)
- Input composition: \(f'_i(x) = f_{\theta_i}([x, \phi])\)
- Function composition: \(f'_i(x) = f_{\theta_i} \odot f_\phi(x)\)
Parameter Composition
Model (adapter) only add a few trainable parameters per task compared with fine-tuning top-layers
Model (prefix tuning) optimize a small continuous task-specific vectors (i.e. prefix)
Model (LoRA, low rank adaptation) constraining the updated parameter to be low-rank
where \(B,A\) has much lower rank
Model (T-few) multiplies activations with learned vectors
Model (setfit) two stage adaptation
- contrastive fine-tuning
- head training
2.5. In-Context Learning (prompting)
Survey (Prompt Methods)
pretrain -> prompt -> predict
It has the following steps:
prompt addition: given an input text \(x\), we apply a template to it
answer search: then we search the text \(z'\) which maximizes the pretrained LM score
answer mapping: the highest scoring asnswer \(\hat{z}\) is mapped to the highest scoring output \(\hat{y}\)
This survey has a table of many good examples
Model (chain of thought) prompt to show how to reasoning:
3. Task
3.1. Information Retrieval
Classical lexical/statistical IR methods are summarized in the search engine note
3.1.1. Dense Retrieval
Model (ME-BERT) Multi-Vector Encoding
Model (ColBERT) late interaction
3.1.2. Reranking
3.1.3. Generation
Model (promptagator) leverages LLM as a few-shot query generator, and creates task-specific retrievers based on the generated data
Model (WebGPT) allows GPT to search/navigate the web
- Behavior cloning
- Reward Modeling
- Reinforcement learning
- Rejection sampling
3.2. Translation
Model (Google translation 2017)
Model (knowledge distillation) Distill knowledge from multiple teacher (trained with each lang-pair data separately) to a single multilingual student. The loss is both NLL loss and distillation loss (cross entropy of student/teacher distribution)
Model (massively multilingual model)
Transfer vs Interference:
the goal is to achieve
- high transfer (positive transfer) to low-resource languages
- low interference (negative transfer) for high-resource languages.
Sampling strategy:
- original language distribution has low-transfer/low-interference
- equal sampling (by upsampling low-resource lang) has high-transfer/high-interference
- this work suggests using a temperature-based sampling has a good balance over transfer/inteference. Sampling prob is \(p_l^{1/T}\) where \(p_l\) is the original distribution and \(T=5\)
4. Reference
[0] original papers. All images are taken from the original papers or blogs
[1] CMU 11-737 Multilingual Natural Language Processing