0x542 Adaptation

1. In-Context Learning
- 1.1. Prompting
- 1.2. RAG
2. Parameter Efficient Fine-tuning
3. Fine-tuning

1. In-Context Learning

1.1. Prompting

Survey (Prompt Methods)

pretrain -> prompt -> predict

It has the following steps:

prompt addition: given an input text \(x\), we apply a template to it

\[x' = f_{\text{prompt}}(x)\]

answer search: then we search the text \(z'\) which maximizes the pretrained LM score

\[\hat{z} = \text{search}_{z \in Z} P(f_{\text{fill}}(x', z); \theta)\]

answer mapping: the highest scoring asnswer \(\hat{z}\) is mapped to the highest scoring output \(\hat{y}\)

This survey has a table of many good examples

prompt

Model (chain of thought) prompt to show how to reasoning:

chain

1.2. RAG

2. Parameter Efficient Fine-tuning

3. Fine-tuning

3.1. Supervised Tuning

Model (GPT) fine-tune the finaly activation pretrained model with the labeled dataset \((x,y)\)

\[P(y | x_1, ..., x_m) = softmax(h_l^m W_y)\]

The objective is

\[L_2 = \sum_{(x,y)} \log P(y | x_1, ..., x_m)\]

The final objective is to combine language model \(L_1\) as an auxilary objective as well (to help convergence and generalization)

\[L = L_2 + \lambda L_1\]

To transform any other tasks into the classification task, it apples input transformation as follows:

gpt_input

Model (BERT) the downstream task formulation is as follows:

sentence -> sentence class: connect the CLS token's embedding with a linear classifier to predict. BERT can be fine-tuned, linear-classifier is trained from scratch
sentence -> per word class: connect every word's embedding with a classifier to train.
two sentencs -> single class: connect two sentences with SEP token and use the CLI to predict. (e.g. NLI task)
sentence -> sentence extraction: If extraction-based QA, suppose document \(D={d_1, d_2, ..., d_N}\) and query \(Q= {q_1, q_2,...,q_M}\), then train a model to use \(D,Q\) to predict two integer \(s,e\) which indicates the answer is \({d_s, ..., d_e}\). \(s,e\) can be found by training two embedding which should be near the target index word's embedding respectively.

3.2. Instruction/Demonstration-Tuning

Dataset (self-instruct) prepare an instruction set in the following manner:

prepare some seed tasks and input, output instances
prompt seed task to generate more tasks
prompt seed task, input, output to generate input/output for the new tasks
filtering outputs to encourage diversity

See the appendix for the prompt examples

3.3. Reward Tuning

Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT

Model (Human Preference Reward)

Begin with a autoregressive language model \(\rho\), it can be considered as a policy:

\[\rho(y|x) = \rho(xy)/\rho(x)\]

where \(x\) is a sequence of input, \(y\) is a sequence of output.

We want to finetune policy \(\pi\) from \(\rho\). If reward function \(r: X \times Y \to R\) is defined , then we can use RL to directly optimize the expected reward \(E_\pi(r)\). However, such a reward function might be difficult to design, so we approximate the reward using human labelings

In this work, we ask humans to choose the best option \(b\) from 4 options \(y_0, y_1, y_2, y_3\), then we fit a reward model \(r\) using the following loss

\[E [\log \frac{\exp(r(x, y_b))}{\sum_i \exp(r(x,y_i))}]\]

Then we fine-tune \(\pi\) wrt the reward model \(r\) And also add a penalty to keep \(\pi\) from moving too far from \(\rho\)

The modified reward is

\[R(x,y) = r(x,y) - \beta \log\frac{\pi(y | x)}{\rho(y | x)}\]

Some related implementation can be found in the TRL repo