Skip to content

0x552 Adaptation

1. In-Context Learning

1.1. Prompting

Survey (Prompt Methods)

pretrain -> prompt -> predict

It has the following steps:

prompt addition: given an input text \(x\), we apply a template to it

\[x' = f_{\text{prompt}}(x)\]

answer search: then we search the text \(z'\) which maximizes the pretrained LM score

\[\hat{z} = \text{search}_{z \in Z} P(f_{\text{fill}}(x', z); \theta)\]

answer mapping: the highest scoring asnswer \(\hat{z}\) is mapped to the highest scoring output \(\hat{y}\)

This survey has a table of many good examples

prompt

Model (chain of thought) prompt to show how to reasoning:

chain

1.2. RAG

2. Parameter Efficient Fine-tuning

3. Supervised Fine-tuning

GPT fine-tunes the finaly activation pretrained model with the labeled dataset \((x,y)\)

\[P(y | x_1, ..., x_m) = softmax(h_l^m W_y)\]

The objective is

\[L_2 = \sum_{(x,y)} \log P(y | x_1, ..., x_m)\]

The final objective is to combine language model \(L_1\) as an auxilary objective as well (to help convergence and generalization)

\[L = L_2 + \lambda L_1\]

To transform any other tasks into the classification task, it apples input transformation as follows:

gpt_input

3.1. Instruction/Demonstration-Tuning

Instruction tuning finetunes language-models on a collection of datasets described via instructions, it improves the performance over unseen tasks when model is large enough (e.g. over 100b in Flan) and more instruction clusters are given

self-instruct prepare an instruction set in the following manner:

  • prepare some seed tasks and input, output instances
  • prompt seed task to generate more tasks
  • prompt seed task, input, output to generate input/output for the new tasks
  • filtering outputs to encourage diversity

See the appendix for the prompt examples

4. Reinforcement Learning

One of the motivation is in this Lecture video from John Schulman.

  • Neural networks store knowledge graph in its weights wich some confidence level. If we supervised fine-tune the model with facts not in the knowledge graph, then we are teaching it to hallucinate.
  • RL is a solution to this issue.

How to fix with RL:

  • adjust output distribution so that model is allowed to express uncertainty, challenge premise, admit error.
  • use RL to precisely learn behavior boundary

4.1. Reward Modeling

4.2. RLHF

Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT

Model (Human Preference Reward)

Begin with a autoregressive language model \(\rho\), it can be considered as a policy:

\[\rho(y|x) = \rho(xy)/\rho(x)\]

where \(x\) is a sequence of input, \(y\) is a sequence of output.

We want to finetune policy \(\pi\) from \(\rho\). If reward function \(r: X \times Y \to R\) is defined , then we can use RL to directly optimize the expected reward \(E_\pi(r)\). However, such a reward function might be difficult to design, so we approximate the reward using human labelings

In this work, we ask humans to choose the best option \(b\) from 4 options \(y_0, y_1, y_2, y_3\), then we fit a reward model \(r\) using the following loss

\[E [\log \frac{\exp(r(x, y_b))}{\sum_i \exp(r(x,y_i))}]\]

Then we fine-tune \(\pi\) wrt the reward model \(r\) And also add a penalty to keep \(\pi\) from moving too far from \(\rho\)

The modified reward is

\[R(x,y) = r(x,y) - \beta \log\frac{\pi(y | x)}{\rho(y | x)}\]

Some related implementation can be found in the TRL repo