# 0x413 Reinforcement Learning

- 1. Foundation
- 2. Dynamic Programming (Model-based)
- 3. Monte Carlo Methods
- 4. Temporal-Difference Learning

This note is mostly about tabular methods from Part 1 of Sutton's textbook (Sutton and Barto, 2018)^{1} and David Silver's UCL lecture (Silver, 2015)^{2}

## 1. Foundation

### 1.1. Bandit Problem

Bandit problems can be considered as a simplification of MDP. They are stateless problems where each arm has a fixed distribution of rewards regardless of previous actions.

#### 1.1.1. Multi-arm Bandit

Suppose we play a game in \(n\) rounds, at each round \(t\), player chooses an action \(A_t=a\) from a finite set of possible choices called arms \(\{ 1, ..., m \}\), and the corresponding reward as \(R_t\). The value of an arbitary action \(a\), denoted \(q_*(a)\) is the expected reward

which is unknown to us, otherwise it is a trivial problem.

We denote the estimated value of action \(a\) at timestep \(t\) as \(Q_t(a)\), which we would like to be close to \(q_*(a)\)

At the end of each turn \(t\), we can estimate the empirical mean reward of arm \(a\) by

where \(T_a(t-1)\) is the total number that action \(a\) has been played.

where \(q_*(a)\) is the expected payoff of action \(a\), and \(q_*\) is the expected payoff best arm.

Applications of multi-armed bandit

There are many applications for multi-armed bandit, for example NY Times uses it to determine which news to show. It can be considered as an alternative to massive AB testing, to test many alternative options at the same time (instead of doing it pairwise)

#### 1.1.2. Contextual Bandit

There is also a generalization of bandit problem called contextual bandit which have access to (state-like) contexts.

**Definition (contextual bandit)** At each round,

- a context (i.e. some clues to states) is selected and revealed to agent
- agent has to choose an action based on the context.
- environment reveals a reward

Unlike full MDP, there is no transition between states. future states are not affected by user's action, they are independent across each round

Application of contextual bandit

In online news recommendation, we can select news to user based on user's context (e.g. location, browser, history, reading habits...)

#### 1.1.3. Algorithms

Our goal is to minimize the **regret**

**Algorithm (\(\epsilon\)-greedy)** With probability \(\epsilon_t\), play an arm randomly, otherwise select the "best" arm found so far

There are many works to show the regret bound of this algorithm, it is at most logarithmic in \(n\).

**Paper (regret bound, Lai, Robbins 1985)** the optimal machine is played exponentially more often than any other machine asymptotically.

For any suboptimal machine \(j\), we have

where \(p_j\) is the reward density for this suboptimal machine.

**Paper (regret bound, Auer 2022)** shows the optimal logarithmic regret is also achievable uniformly over time.

**Algorithm (UCB, upper confidence bound)** At each round, chooses the arm with the highest upper confidence interval given by

The regret bound of this also grows logarithmically in \(n\)

**Algorithm (Gradient Bandit Algorithm)** choose the action based on preference \(H\) instead of value function.

Preference is updated by

depending on \(a = A_t\) or \(a \neq A_t\)

It is a stochastic approximation to gradient ascent

### 1.2. Reinforcement Learning

The most important aspect in RL is that it uses training info that **evaluates** the action taken rather than **instructs** by giving correct action: it only indicates how good an taken action was, but do not tell whether it was the best or worst. In the instructive setting, the instruction is independent of the action taken.

A lot of machine learning problems are one-shot, in reinforcement learning, we care about **sequential decisions**: each decision we make have consequences in the future.

RL problems can be expressed as a system consisting an agent and an environment. The system repeats the following circles,

- An environment produces information which describes the
**state**\(s_t\) of the system - An agent interacts with the environment by observing the state and using this information to select an
**action**\(a_t\). - The environment accepts the action and transitions into the next state, then return the next state and a
**reward**\(r_t\) to the agent

A few concepts in RL are

**Concept (prediction vs control)**:

- prediction: evaluate the value function, also called evaluation problem
- control: find the optimal policy

**Concept (direct vs indirect RL)**

- direct RL: directly improve the value/policy from the real experience
- indirect RL: learn the model first and then use it to do the planning

**Concept (model-based vs model-free)** Whether an algorithm uses or estimates model's environment: transition probability distribution \(P(s_{t+1}| s_t, a_t)\) and reward function.

By a model of the environment, we mean anything that an agent can use to predict how the environment will respond to its actions

- distribution models: model produce a description of all possibilities and probabilities
- sample model: a distribution model produce an individual possibility

Purely model based approaches are commonly applied to games with a target state: winning or lossing in a game. (e.g: MCTS (monte carlo tree search))

Advantage:

- It allows the agent to plan by thinking ahead
- sample efficiency

Disadvantage:

- models are hard to come by
- may learn irrelevant details

**Concept (on-policy vs off-policy)** Whether an algorithm learns with data gathered using just the current policy

#### 1.2.1. MDP

MDPs is a classical formulation of sequential decision making, it extends bandit problems by introducing states where actions influence not just immediate rewards but also future states and rewards.

In bandit problems, we only estimate \(q(a)\), in MDPs, we estimate \(q(s,a)\) and \(v(s)\).

We assume the the transition to the next state \(s_{t+1}\) only depends on the previous state \(s_{t}\) and action \(a_t\), which is known as the markov property

**Definition (Markov Decision Process)** Markov Decision Process is defined as a 4-tuple \((\mathcal{S}, \mathcal{A}, \mathcal{R}, P)\)

- \(S\): state space (depend on previous state and action)
- \(A\): action space
- \(R\): reward space
- \(P\): transition prob summarizing MLP's dynamics: \(p: \mathcal{S} \times \mathcal{R} \times \mathcal{S} \times \mathcal{A} \to [0, 1]\)

This four-arg dynamics function \(p\) can be used to derive many other things:

For example, the transition prob distribution:

The expected reward

The interaction between environment and agent can be expressed in Python as

```
# Given an env (environment) and an agent
for episode in range(MAX_EPISODE):
state = env.reset()
agent.reset()
for t in range(T):
action = agent.act(state)
state, reward = env.step(action)
agent.update(action, state, reward)
if env.done():
break
```

The **reward function** is defined with respect to an episode \(\tau = (s_0, a_0, r_0), ..., (s_T, a_T, r_T)\) the discount ratio \(\gamma\)

The objective \(J(\pi)\) is the expectations of the returns over many trajectories

A return \(R_t\) of a trajectory is defined as the discounted sum of award from timestamp \(t\) to the end of the trajectory

where both \(G_t, R_t\) are random variables.

**Definition (POMDP: partially observable MDP)** is a generalization of MDP, which the agent cannot observe the underlying state. it only observes state partially through sensor, so there are two states:

- observable state: the state observed by the agent
- internal state: the state inside the environment

This is the actual model in the real-world

#### 1.2.2. Policies and Value Functions

**Definition (policy)** A policy \(\pi\) is how an agent produces the action in the environment to maximize the objective, it is a distribution over actions given state \(\pi(a|x)\)

**Definition (state value Function)** The state value function \(v\) evaluates a state **under a certain policy**, it evaluates how good or bad a state is

**Definition (action value function, q)** The action value function evaluates a (state, action) pair under a certain policy.

Those functions can be estimated by taking average of randomly sample of actual returns (e.g: Monte Carlo) when there are few states. In the case there are very many states, it is practical to maintain \(v_\pi, q_\pi\) as parametrized functions instead.

**Definition (advantage function)** advantage function is defined as

**Definition (Bellman Expectation Equation)** Bellman Expectation Equation is defined with respect to a policy \(\pi\)

state value function can be derived from action value function

action value function can be derived from next state value function

Merging those two together, we get

#### 1.2.3. Optimal Policies and Optimal Value Functions

**Definition (Optimal Value Function)** the optimal state value function is the maximum value function over all policies

MDP is solved if we know the optimal action value function

It is well known that there exists an optimal policy \(\pi^*\)

**Theorem (existence of Optimal Policy)** a policy is better than another when its value function on all states should be better than the value function for the other. Note its proof is non-trivial since policy is partial order

**Definition (Bellman Optimality Equation)** optimal value function are recursively related by the Bellman optimality equations

optimal action function can be expanded in the same manner

Unfolding the action value function

Notice that Bellman expectation equation is taking sum, and Bellman optimality equation is taking max

Solution of Bellman Optimality Equation

- nonlinear, no closed form solution in general
- Iterative solution methods

## 2. Dynamic Programming (Model-based)

DP solves a fully **known** MDP by statically analyzing it (i.e. do not need to dynamically generating episodes)

Problem | Bellman Equation | Algorithm |
---|---|---|

Prediction | Bellman Expectation Equation | Iterative Policy Evaluation |

Control | Bellman Expectation Equation + Greedy Policy Improvement | Policy Iteration |

Control | Bellman Optimality Equation | Value Iteration |

### 2.1. Policy Evaluation (Prediction)

`Policy evaluation`

solves the following *prediction problem*:

- Input: MDP, policy \(\pi\)
- output: value function \(v_\pi\)

First we consider how to compute the state-value function \(v_\pi\) for an arbitrary policy \(\pi\).

If the dynamics \(p\) (i.e. MDP) is completely known, then it can be iteratively solved using Bellman expectation backup

This algorithm is called **iterative policy evaluation**

More concretely, given a policy \(\pi\), we iteratively update value function `synchronously`

:

- At each iteration, for all states \(s \in S\)
- we update \(v_{k+1}\) from \(v_k(s')\) where \(s'\) is a successor state of \(s\)

It can be shown the sequence \(v_k\) converges to the fixed point \(v_\pi\).

### 2.2. Policy Improvement

After policy iteration, we can improve the policy by acting greedily wrt \(v_{\pi}\)

where \(\pi' \geq \pi\)

**Theorem (policy improvement theorem)** Let \(\pi, \pi'\) be any deterministic policies such that if for all \(s\)

then the policy \(\pi'\) must be as good as, or better than \(\pi\):

Using this theorem, we can suggest the following **policy improvement** algorithm

**Model (policy improvment)** Given a policy \(\pi\), Evaluate the policy to get \(v_{\pi}(s)\) Improve the policy to \(\pi'\) by acting greedily with respect to \(v_{\pi}\)

### 2.3. Policy Iteration (Control)

Policy Iteration solves the following *control problem*:

- Input: MDP
- output: optimal value function \(v_*\) and optimal policy \(\pi_*\)

**Model (policy iteraion)** By repeating policy evaluation and policy improvment, we get the policy iteration algorithm.

No matter how we choose $\pi_0 and \(v_{0}\), this policy iteration always converges to \(\pi_{*}\): when the new greedy policy has the same value function as the old policy

This satisfies the Bellman optimality equation, which indicates \(v_\star = v_{\pi'}\)

A simple proof of convergence can be found in David Silver's lecture slide

One drawback of policy iteration is the policy evaluation is expensive to converge. In general, we do not need to wait until the value function converge.

**Model (generalized policy iteration)** we let the policy-evaluation and policy-improvment processes interact, independent of the granularity and other details of the two processes.

Value Iteration (control)

value iteration can be seen as an instance of generalized policy iteration where we we update policy every time updating the value function (i.e. only doing one sweep of policy evaluation).

Policy improvement and truncated policy evaluation can be combined into the following Bellman optimality backup:

For arbitrary \(v_0\), the sequence \(\{ v_k \}\) converges to \(v_*\)

Intuitively, value iteration start with final rewards and work backwards.

Note the difference between value iteration and policy evaluation is this one is taking max instead of sum.

### 2.4. Asynchronous Dynamic Programming

DP above are *synchronous* backups where all states are updated in a 'lockstep' (i.e. expensive)

*Asynchronous DP* updates states individually in any order. It is guaranteed to converge if all states continue to be selected.

Three simple ideas for async DP:

- in-place DP: update state in-place (i.e. only stores one copy of value function)
- prioritised sweeping: use a priority queue to keep track of most likely affected states after each update
- realtime DP: use agent experience (episode) to select states to be updated

## 3. Monte Carlo Methods

MC method can work with both model-based and model-free settings

Unlike the DP, in MC method, we do not need to assume complete knowledge of the environment: we only require experience (i.e. sample *complete* episode from the actual experience.)

### 3.1. On-policy Prediction

MC methods learn from the *full* episode

In DP, we *computed* value functions from knowledge of MDP, here we *learned* value functions from sample returns using *empirical mean return*

**Model (MC prediction)** updates of value function happen after end of every episode and complete return

There are two flavors:

- in
*first-visit*MC, update is happening when state \(s\) is visited for the first time \(t\) in an episode, \(N(s)\) is the total number of episodes when state \(s\) is ever visited so far - in
*every-visit*MC, update is happening evertime state \(s\) is visited at timestamp \(t\) where \(N(s)\) is the total number of appearances in all episodes so far

By law of large numbers, when \(N(s) \to \infty\),

Note that if a model is available, we still use the state value functions \(v_\pi(s)\)

When the model is not available, so we typically estimate action value function \(Q_\pi(s,a)\) instead of \(V_\pi(s)\)

We sample a full episode using \(\pi\): \(\{ S_1, A_1, R_1, S_2, ... \} \sim \pi\). For each state \(S_t\) and action \(A_t\) in an episode, we update \(Q_\pi(s, a)\) as following

### 3.2. On-policy Control

In the model-free settings, we cannot do policy improvement using state value function \(V_\pi(s)\) since we do not know MDP dynamics.

Instead of using \(V_\pi(s)\), we use action value function \(Q_\pi(s, a)\) where policy improvment is done using:

We can do similar pilicy iteration in the following manner:

Greedy policy improvement in this case cannot explore all states and actions and fails to be optimized.

To guarantee the convergence, it requires the assumption of the **exploring starts**

To be efficient, we do not need to wait until \(Q_\pi\)'s convergence. We can update policy after every episode

### 3.3. Off-policy Prediction

Off-policy methods are of greater variance and are slower to converge, but they are more powerful and general. It allows the agent to learn from a more diverse set of experiences, which can help it to explore the environment more effectively

There are two policies involved:

**target policy**: policy being learned**behaviror policy**: the policy used to generate behavior

We want to estimate \(v_\pi\) or \(q_\pi\) (target policy) but episodes are following another policy \(b\) (behavior policy).

There is a constraint called coverage which says \(\pi(a|s) > 0\) must imply \(b(a | s) > 0\)

Almost all off-policy methods are using **importance sampling** (my note)

Given a episode \(A_t, S_{t+1}, A_{t+1}, ..., S_T\) and its return \(G_t\), first, notice that the expectation gives value function of the behavior policy, not the target policy:

To apply importance sampling, we need to consider the relative probability ratio under two difference policies

Given a episode \(A_t, S_{t+1}, A_{t+1}, ..., S_T\)

Using this ratio, we obtain the value function of target policy

To estimate \(v_\pi(s)\) using previous formula, we can do the *ordinary importance sampling*

where \(\tau(s)\) is the set of all timestamps in which state \(s\) is visited

or the *weighted importance sampling*

In the first-visit methods, ordinary importance sampling is unbiased with large variance, weighted importance sampling is biased (asymptotically unbiased) but with much smaller variance and is strongly preferred. In the every-visit method, both are biased.

### 3.4. Off-policy Control

why we want off-policy:

- learn from human and other agents
- re-use experience from old policies: \(\pi_0, \pi_1, ..., \pi_{t-1}\)
- learn about optimal policy while following exploratary policy
- learn from multiple policies while following one policy

## 4. Temporal-Difference Learning

TD learning is also model-free methods, which do not require knowledge of MDP transitions/rewards. Note it works for model-based method as well.

Instead of using complete episode, TD learning is using *incomplete* episodes (or partial episodes), by bootstraping

TD vs MC

MC has high variance, zero bias

- good convergence properties (even with function approximation)
- not sensitive to initial value

TD has low variance, some bias

- usually more efficient than MC
- TD(0) converges to \(v_\pi(s)\) but not always with function approximation
- more sensitive to initial value
- usually limited to discrete actions

TD vs DP

TD can be seen as the sampling version of DP as the following figure shows:

### 4.1. TD Prediction

Recall Monte Carlo do the following prediction

where \(G_t\) is the actual return following time \(t\). It means we must wait until the end of the episode to determine the increment to \(V(S_t)\)

In contrast, TD methods use cached value instead of the waiting complete *actual return*. For example,

**Model (TD(0))** it updates value \(V(S_t)\) toward *estimated return* \(R_{t+1} + \gamma V(S_{t+1})\) instead of the *actual return* \(G_t\)

where \(R_{t+1}+\gamma V(S_{t+1})\) is called *TD target* and \(R_{t+1}+\gamma V(S_{t+1})-V(S_t)\) is referred to as *TD error*

Because it bases its update in part on an existing estimate, it is called **bootstrapping** method.

The comparison between MC (left) and TD (right) can be illustrated as following

(Reference: from RL Book 2^{nd} version)

There is a bias/variance tradeoff

- return \(G_t\) is an unbiased estimate of \(v_\pi(S_t)\), the
*true*TD target is also unbiased \(R_{t+1} + \gamma v_\pi(S_{t+1})\) - The actual TD target \(R_{t+1} + \gamma V(S_{t+1})\) is biased
- however, TD target has much lower variance
- this is because return \(G_t=R_{t+1}+\gamma R_{t+2}+..\) depends on many random variable \(R_t\), but TD target has only one random variable

It can be generalize to look n steps into the future

**Model (n-step TD prediction)**: TD can be extended into n-step methods

- \(n=1\) (TD): \(G_t^{(1)} = R_{t+1} + \gamma V(S_{t+1})\)
- \(n=2\): \(G_t^{(2)} = R_{t+1} + \gamma V(S_{t+2}) + \gamma^2 V(S_{t+2})\)
- \(n=\infty\): (MC) \(G_t^{(\infty)} = R_{t+1} + \gamma V(S_{t+2}) + \gamma^2 R_{t+3} + ...\)

where the n-step return

and n-step TD prediction is

we can average multiple n steps

**Model (TD(\(\lambda\)) prediction)** TD(lambda) combines all n-steps return with decaying weight.

### 4.2. On-policy TD Control (SARSA)

Instead of using TD to update state values, we use TD to update action values with a tuple \((S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})\)

See the following algorithm:

SARSA converges to the optimal action-value function under some regularities

**Model (n-step SARSA)** SARSA can be extended to n-step as well where

- \(n=1\) (SARSA): \(q_t^{(1)} = R_{t+1} + \gamma Q(S_{t+1})\)
- \(n=2\): \(q_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 Q(S_{t+2})\)
- \(n=\infty\) (MC): \(q_t^{(\infty)}=R_{t+1} + \gamma R_{t+2} + ...\)

The n-step SARSA updates \(Q(s,a)\) towards the n-step Q-return

**Model (SARSA(\(\lambda\)))** we also have the Sarsa(\(\lambda\)), which combines all n-step \(q_t^{(n)}\) as well

where

### 4.3. Off-policy TD Control (Q-Learning)

The most general Q-learning has the following form

while \(A'\) is sampled from the behavior policy \(A' \sim b(a | S_{t+1})\) instead of the target policy \(A_{t+1} \sim \pi(a | S_{t+1})\)

**Model (Q-learning)** The most commonly used Q-learning (Watkins, 1989)^{3} was one of the early breakthroughs in RL

It learned the action value function \(Q\) to directly approximate the optimal action-value function \(q_*\), independent of the policy being followed:

The following term is called Bellman error, which we want to minimize. It becomes 0 in Bellman optimality equation

Under some regularities, it is shown that \(Q\) converges with probability 1 to \(q^*\)

Q-learning is an off-policy as the next action \(a\) is obtained by maximizing \(Q\) instead of following current policy \(\pi\). This characteristic also allows us to do \(\epsilon\)-greedy policy to explore (since it is off-policy)

**Model (expected SARSA)** Another variant is to take expectation for backup

It is more computational expensive, but eliminates the variance due to the random selection of \(A_{t+1}\)

Expected Sarsa can be used on-policy, but in general, the policy \(\pi\) can be different from the target policy to generate behavior, in which case it becomes an off-policy.