Skip to content

0x542 Adaptation

1. In-Context Learning

1.1. Prompting

Survey (Prompt Methods)

pretrain -> prompt -> predict

It has the following steps:

prompt addition: given an input text \(x\), we apply a template to it

\[x' = f_{\text{prompt}}(x)\]

answer search: then we search the text \(z'\) which maximizes the pretrained LM score

\[\hat{z} = \text{search}_{z \in Z} P(f_{\text{fill}}(x', z); \theta)\]

answer mapping: the highest scoring asnswer \(\hat{z}\) is mapped to the highest scoring output \(\hat{y}\)

This survey has a table of many good examples

prompt

Model (chain of thought) prompt to show how to reasoning:

chain

1.2. RAG

2. Parameter Efficient Fine-tuning

3. Fine-tuning

3.1. Supervised Tuning

Model (GPT) fine-tune the finaly activation pretrained model with the labeled dataset \((x,y)\)

\[P(y | x_1, ..., x_m) = softmax(h_l^m W_y)\]

The objective is

\[L_2 = \sum_{(x,y)} \log P(y | x_1, ..., x_m)\]

The final objective is to combine language model \(L_1\) as an auxilary objective as well (to help convergence and generalization)

\[L = L_2 + \lambda L_1\]

To transform any other tasks into the classification task, it apples input transformation as follows:

gpt_input

Model (BERT) the downstream task formulation is as follows:

  • sentence -> sentence class: connect the CLS token's embedding with a linear classifier to predict. BERT can be fine-tuned, linear-classifier is trained from scratch

  • sentence -> per word class: connect every word's embedding with a classifier to train.

  • two sentencs -> single class: connect two sentences with SEP token and use the CLI to predict. (e.g. NLI task)

  • sentence -> sentence extraction: If extraction-based QA, suppose document \(D={d_1, d_2, ..., d_N}\) and query \(Q= {q_1, q_2,...,q_M}\), then train a model to use \(D,Q\) to predict two integer \(s,e\) which indicates the answer is \({d_s, ..., d_e}\). \(s,e\) can be found by training two embedding which should be near the target index word's embedding respectively.

3.2. Instruction/Demonstration-Tuning

Dataset (self-instruct) prepare an instruction set in the following manner:

  • prepare some seed tasks and input, output instances
  • prompt seed task to generate more tasks
  • prompt seed task, input, output to generate input/output for the new tasks
  • filtering outputs to encourage diversity

See the appendix for the prompt examples

3.3. Reward Tuning

Reinforcement Learning from Human Feedback (RLHF) used in ChatGPT

Model (Human Preference Reward)

Begin with a autoregressive language model \(\rho\), it can be considered as a policy:

\[\rho(y|x) = \rho(xy)/\rho(x)\]

where \(x\) is a sequence of input, \(y\) is a sequence of output.

We want to finetune policy \(\pi\) from \(\rho\). If reward function \(r: X \times Y \to R\) is defined , then we can use RL to directly optimize the expected reward \(E_\pi(r)\). However, such a reward function might be difficult to design, so we approximate the reward using human labelings

In this work, we ask humans to choose the best option \(b\) from 4 options \(y_0, y_1, y_2, y_3\), then we fit a reward model \(r\) using the following loss

\[E [\log \frac{\exp(r(x, y_b))}{\sum_i \exp(r(x,y_i))}]\]

Then we fine-tune \(\pi\) wrt the reward model \(r\) And also add a penalty to keep \(\pi\) from moving too far from \(\rho\)

The modified reward is

\[R(x,y) = r(x,y) - \beta \log\frac{\pi(y | x)}{\rho(y | x)}\]

Some related implementation can be found in the TRL repo