0x433 Neural Model
 1. Foundation Language Model
 2. Adaptation
 3. Task
 4. Reference
1. Foundation Language Model
1.1. Encoder Model (Masked Model)
1.1.1. Traditional Contextual Models
Model (CoVe, Context Vector) Train the contextualized embedding with supervised dataset (i.e translation)
use the concatenation of GloVe and CoVe as representation for downstream tasks
Model (ELMo, Embeddings from Language Model) Learns the contextualized word representation using a bidirectional LSTM language model.
pretraining
The model is to minimize the NLL in both direction
param (100M)
downstream
After training, ELMo obtains representation by combining hidden vectors from both direction and across all layers with a task specific weight \(w_i\)
Different tasks seem to have different weights across different layers.
contextualized embedding
Topic models and naive embedding models assign fixed embedding or representations to each word. However, the word might have different meaning in different contexts, therefore some recent models are using contextualized representations instead of the fixed embeddings.
The contextualized models usually use generative language modeling or masked language modeling
1.1.2. BERT
Model (BERT) a multilayer bidirectional Transformer
Config:
 BERT base: 110M (L=12, H=768, A=12), this config is chosen to be the same size as GPT
 BERT large: 340M (L=24, H=1024, A=16)
Pretraining
There are two approaches to train bert: Masked LM and Next Sentence Prediction. They are used at the same time.
 Masked LM: 15% randomly mask a word and use other words to predict it. The final hidden of masked words are fed into softmax over vocabulary. cross entropy loss is then applied.
 next sentence prediction: classify whether one sentence come after the other sentence
Model (SpanBERT) SpanBERT is an improvement of BERT by masking span instead of individual tokens
Objectives are
 Span Masked LM: similar to the BERT model but a randomly selected span is masked
 Span Boundary Objective: encourage boundary tokens to predict each word in the span with positional encoding
Reference: from the SpanBERT paper
Model (RoBERTa) difference between BERT is
 BERT using static mask (mask are decided during preprocessing), RoBERTa chose mask every epoch dynamically
 each input containing 512 token can be sampled from contiguous sentence with a separator
 next sentence prediction task is dropped
 bigger batch, faster learning rate, bigger training set
1.1.3. CrossLingual Model
Model (mBERT) Same arch as BERT, but it is trained on the Wikipedia pages of 104 languages
Analysis (mBERT transfer)
 approach: finetune mBERT using a specific task in one language, but testing it in another language
 it generalizes well crosslingual, especially for lexically similar language, but also works for languages using different scripts as well (Urdo written in Arabic, transferred to Hindi written in Devanagari)
Model (monolingual BERT transfer)
 adapt monolingual bert to a new language by freeze the encoder but retrain the embedding layer
Model (XLM)
Translation Language Model
 predict a masked English word by attending both English and French, which encourage to align their representations.
Model (adapterbased transfer, MADX) using language, task adapters
1.2. EncoderDecoder Model
1.2.1. BART
Model (BART)
BART is a denoising encoderdecoder model trained by
 corrupting text with noise
 learn a model to reconstruct the original text
BART noise are as follows
1.2.2. T5
Model (T5, TexttoText Transfer Transformer)
Also check the Blog
1.3. Decoder Model (Causal Language Model)
1.3.1. GPT
GPT is a language model using transformer. Check Mu Li's video
Model (GPT) 0.1B
Check the next section for details
Model (GPT2) 1.5B
Model (GPT3) 175B
1.3.2. Transformer XL
Model (Transformer XL) overcome the fixedlength context issue by
 segmentlevel recurrence: hidden values of the previous segment is cached and provided to the next segment
 relative positional encoding: use fixed embedding with learnable transformation
See this blog
1.3.3. XLNet
Model (XLNet) Permutation language model
1.3.4. Distributed Models
Model (LaMDA) A decoder only dialog model
 pretrain on next word prediction
 finetuned using "context sentinel response" format
See this Blog
Model (PaLM, Pathway LM)
See this blog
1.4. Analysis
1.4.1. Scaling
Check this lecture series
Analysis (do you need billions of words of pretraining data) LM requires only 10M or 100M words to learn syntactic/semantic features, a much larger datase (1B, 30B) t is required to acquire common sense knowledge
Analysis (scaling, scaling laws for neural language model) crossentropy loss scales as a powerlaw wrt model size, dataset size, computation size:
Aanalysis (Ushape scaling) there are a few tasks that has worse performance with larger models, those tasks, however, actually have the Uscaling curve, where the decreased performance with medium model might be explained by the "distractor task"
1.4.2. Sampling
Model (nucleus sampling)
Both deterministic approach (beam search) and random approach (pure sampling) have cons:
 beam search: degenerate repetition, it is less surprising
 pure sampling: incoherent generation
1.4.3. Calibration
Model (confidence calibration) the probability associated with the predicted class label should reflect its ground truth correcteness
Suppose the neural network is \(h(X) = (\hat{Y}, \hat{P})\), where \(\hat{Y}\) is the prediction, \(\hat{P}\) is the associated confidence, a perfect calibration should satisfy
A measurement of calibrartion is ECE (Expected Calibration Error) defined as the difference between confidence and actual probability
Analysis (larger models are wellcalibrated) larger models are wellcalibrated in the right format
2. Adaptation
2.1. Supervised Tuning
Model (GPT) finetune the finaly activation pretrained model with the labeled dataset \((x,y)\)
The objective is
The final objective is to combine language model \(L_1\) as an auxilary objective as well (to help convergence and generalization)
To transform any other tasks into the classification task, it apples input transformation as follows:
Model (BERT) the downstream task formulation is as follows:

sentence > sentence class: connect the CLS token's embedding with a linear classifier to predict. BERT can be finetuned, linearclassifier is trained from scratch

sentence > per word class: connect every word's embedding with a classifier to train.

two sentencs > single class: connect two sentences with SEP token and use the CLI to predict. (e.g. NLI task)

sentence > sentence extraction: If extractionbased QA, suppose document \(D={d_1, d_2, ..., d_N}\) and query \(Q= {q_1, q_2,...,q_M}\), then train a model to use \(D,Q\) to predict two integer \(s,e\) which indicates the answer is \({d_s, ..., d_e}\). \(s,e\) can be found by training two embedding which should be near the target index word's embedding respectively.
2.2. Instruction/DemonstrationTuning
Dataset (selfinstruct) prepare an instruction set in the following manner:
 prepare some seed tasks and input, output instances
 prompt seed task to generate more tasks
 prompt seed task, input, output to generate input/output for the new tasks
 filtering outputs to encourage diversity
See the appendix for the prompt examples
2.3. Reward Tuning
Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT
Model (Human Preference Reward)
Begin with a autoregressive language model \(\rho\), it can be considered as a policy:
where \(x\) is a sequence of input, \(y\) is a sequence of output.
We want to finetune policy \(\pi\) from \(\rho\). If reward function \(r: X \times Y \to R\) is defined , then we can use RL to directly optimize the expected reward \(E_\pi(r)\). However, such a reward function might be difficult to design, so we approximate the reward using human labelings
In this work, we ask humans to choose the best option \(b\) from 4 options \(y_0, y_1, y_2, y_3\), then we fit a reward model \(r\) using the following loss
Then we finetune \(\pi\) wrt the reward model \(r\) And also add a penalty to keep \(\pi\) from moving too far from \(\rho\)
The modified reward is
Some related implementation can be found in the TRL repo
2.4. ParameterEfficient FineTuning (PEFT)
check this talk
Let a neural network \(f_\theta: \mathcal{X} \to \mathcal{Y}\) be decomposed into a composition of functions \(f_{\theta_1} \odot f_{\theta_{2}} \dot ... \odot f_{\theta_{l}}\). each has parameters \(\theta_i\)
A module with parameters \(\phi\) can modify the \(i\)th subfunction as follows:
 Parameter composition: \(f'_i(x) = f_{\theta_i \oplus \phi}(x)\)
 Input composition: \(f'_i(x) = f_{\theta_i}([x, \phi])\)
 Function composition: \(f'_i(x) = f_{\theta_i} \odot f_\phi(x)\)
Parameter Composition
Model (adapter) only add a few trainable parameters per task compared with finetuning toplayers
Model (prefix tuning) optimize a small continuous taskspecific vectors (i.e. prefix)
Model (LoRA, low rank adaptation) constraining the updated parameter to be lowrank
where \(B,A\) has much lower rank
Model (Tfew) multiplies activations with learned vectors
Model (setfit) two stage adaptation
 contrastive finetuning
 head training
2.5. InContext Learning (prompting)
Survey (Prompt Methods)
pretrain > prompt > predict
It has the following steps:
prompt addition: given an input text \(x\), we apply a template to it
answer search: then we search the text \(z'\) which maximizes the pretrained LM score
answer mapping: the highest scoring asnswer \(\hat{z}\) is mapped to the highest scoring output \(\hat{y}\)
This survey has a table of many good examples
Model (chain of thought) prompt to show how to reasoning:
3. Task
3.1. Information Retrieval
Classical lexical/statistical IR methods are summarized in the search engine note
3.1.1. Dense Retrieval
Model (MEBERT) MultiVector Encoding
Model (ColBERT) late interaction
3.1.2. Reranking
3.1.3. Generation
Model (promptagator) leverages LLM as a fewshot query generator, and creates taskspecific retrievers based on the generated data
Model (WebGPT) allows GPT to search/navigate the web
 Behavior cloning
 Reward Modeling
 Reinforcement learning
 Rejection sampling
3.2. Translation
Model (Google translation 2017)
Model (knowledge distillation) Distill knowledge from multiple teacher (trained with each langpair data separately) to a single multilingual student. The loss is both NLL loss and distillation loss (cross entropy of student/teacher distribution)
Model (massively multilingual model)
Transfer vs Interference:
the goal is to achieve
 high transfer (positive transfer) to lowresource languages
 low interference (negative transfer) for highresource languages.
Sampling strategy:
 original language distribution has lowtransfer/lowinterference
 equal sampling (by upsampling lowresource lang) has hightransfer/highinterference
 this work suggests using a temperaturebased sampling has a good balance over transfer/inteference. Sampling prob is \(p_l^{1/T}\) where \(p_l\) is the original distribution and \(T=5\)
4. Reference
[0] original papers. All images are taken from the original papers or blogs
[1] CMU 11737 Multilingual Natural Language Processing