0x552 Adaptation
- 1. In-Context Learning
- 2. Parameter Efficient Fine-tuning
- 3. Supervised Fine-tuning
- 4. Reinforcement Learning
1. In-Context Learning
1.1. Prompting
Survey (Prompt Methods)
pretrain -> prompt -> predict
It has the following steps:
prompt addition: given an input text \(x\), we apply a template to it
answer search: then we search the text \(z'\) which maximizes the pretrained LM score
answer mapping: the highest scoring asnswer \(\hat{z}\) is mapped to the highest scoring output \(\hat{y}\)
This survey has a table of many good examples
Model (chain of thought) prompt to show how to reasoning:
1.2. RAG
2. Parameter Efficient Fine-tuning
3. Supervised Fine-tuning
GPT fine-tunes the finaly activation pretrained model with the labeled dataset \((x,y)\)
The objective is
The final objective is to combine language model \(L_1\) as an auxilary objective as well (to help convergence and generalization)
To transform any other tasks into the classification task, it apples input transformation as follows:
3.1. Instruction/Demonstration-Tuning
Instruction tuning finetunes language-models on a collection of datasets described via instructions, it improves the performance over unseen tasks when model is large enough (e.g. over 100b in Flan) and more instruction clusters are given
self-instruct prepare an instruction set in the following manner:
- prepare some seed tasks and input, output instances
- prompt seed task to generate more tasks
- prompt seed task, input, output to generate input/output for the new tasks
- filtering outputs to encourage diversity
See the appendix for the prompt examples
4. Reinforcement Learning
One of the motivation is in this Lecture video from John Schulman.
- Neural networks store knowledge graph in its weights wich some confidence level. If we supervised fine-tune the model with facts not in the knowledge graph, then we are teaching it to hallucinate.
- RL is a solution to this issue.
How to fix with RL:
- adjust output distribution so that model is allowed to express uncertainty, challenge premise, admit error.
- use RL to precisely learn behavior boundary
4.1. Reward Modeling
4.2. RLHF
Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT
Model (Human Preference Reward)
Begin with a autoregressive language model \(\rho\), it can be considered as a policy:
where \(x\) is a sequence of input, \(y\) is a sequence of output.
We want to finetune policy \(\pi\) from \(\rho\). If reward function \(r: X \times Y \to R\) is defined , then we can use RL to directly optimize the expected reward \(E_\pi(r)\). However, such a reward function might be difficult to design, so we approximate the reward using human labelings
In this work, we ask humans to choose the best option \(b\) from 4 options \(y_0, y_1, y_2, y_3\), then we fit a reward model \(r\) using the following loss
Then we fine-tune \(\pi\) wrt the reward model \(r\) And also add a penalty to keep \(\pi\) from moving too far from \(\rho\)
The modified reward is
Some related implementation can be found in the TRL repo