Action Value Estimation: A Bandit Perspective

A gentle introduction to the fundamentals of reinforcement learning

How to calculate action value estimates

**How to calculate action value estimates $Q(a)$ / estimate $q_{*}(a)$?**

There are many ways to estimate $q_{*}(a)$, one way is sample-average method:

by sample average method, for $n$th timestep:

\[Q_{n}(a) = \frac{ \sum_{i=1}^{n-1} R_{i} \cdot \mathbb{Y}_{A_{i} =a} \ }{\sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}\]

Here $\mathbb{Y}_{A_{i}= a} = \begin{cases}1 \ \ \ \text{if} \ \ \ A_{i} = a \\ 0 \ \ \ \text{otherwise} \end{cases}$

$R_{i} =$ reward received at time-step $i$

if denominator is zero, we define the default value as $Q(a) = 0$

The goal of the agent is to maximize the value which is given by

\[\underset{a}{argmax} \ \ Q_{n}(a)\]

This is called Greedy Action Selection.

Alternatively, The agent can choose the greedy action with a probability $\epsilon$ and non-greedy actions with a probability $1-\epsilon$. This is called epsilon greedy action selection.

\[A_t \leftarrow \begin{cases}\underset{a}{argmax} Q_{t}(a) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{with probability} \ \ 1- \epsilon \\ \\ a \sim Uniform(\{a_1 \dots a_k\}) \ \text{with probability} \ \ \epsilon\end{cases}\]

Incremental Estimate of Action Values:

\[Q_{n+1} = \frac{\sum_{i=1}^{n}R_{i}.\mathbb{Y}_{A_{i = a}}}{\sum_{i=1}^{n}\mathbb{Y}_{A_{i} = a}}\] \[= \frac{R_{n}.\mathbb{Y}_{A_{n = a}} + \sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a} \cdot \frac{\sum_{i=1}^{n-1}R_{i}.\mathbb{Y}_{A_{i = a}}}{\sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}}{\sum_{i=1}^{n}\mathbb{Y}_{A_{i} = a}}\] \[= \frac{R_{n}\cdot \mathbb{Y}_{A_{n=a}} + Q_{n} \cdot \sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}{\mathbb{Y}_{A_{n=a}} + \sum_{i = 1}^{n-1}\mathbb{Y}_{A_{i = a}}}\]

We already know that $\mathbb{Y}_{A_{n=a}} = 1$

Let the number of times the action $a$ has been taken until the time-step $n$+1 defined by $\sum_{i=1}^{n}\mathbb{Y}_{A_{i = a}}$ be $k$

Hence the number of times the action $a$ has been taken until the time-step n defined by $\sum_{i=1}^{n-1} \mathbb{Y}_{A_{i=a}}$ becomes $k-1$

\[= \frac{R_{n} + Q_{n} \cdot (k-1)}{k}\] \[\implies Q_{n+1} = Q_n + \frac{1}{k} (R_{n} - Q_{n})\]

If $\frac{1}{k}$ is considered as step-size $\alpha_{n}$ then the equation becomes:

\[Q_{n+1} = Q_{n} + \alpha_{n}(R_{n} - Q_{n})\]

General Estimation of Action Values:

\[Q_{n+1} = Q_{n} + \alpha_{n}(R_{n}-Q_{n})\]

we replace $Q_{n}$ with the above incremental update equation

\[Q_{n+1} = Q_{n-1} +\alpha_{n-1}(R_{n-1}-Q_{n-1}) + \alpha_{n}(R_{n} - Q_{n-1} -\alpha_{n-1}(R_{n-1}-Q_{n-1}))\]

if $\alpha_{n}$ is a fixed value say $\alpha$

\[Q_{n+1} = \alpha R_{n}+ (\alpha -\alpha^{2})R_{n-1}+(1-\alpha)^2Q_{n-1}\] \[=\alpha R_{n}+(1-\alpha)\alpha R_{n-1} + (1-\alpha)^2 Q_{n-1}\]

$Q_{n-1}$ can be substituted recursively further to realise

\[=(1-\alpha)^n Q_{1} + \sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_{i}\]

This is also called the General Estimation Rule of action values which signifies that the most recent rewards have highest influence on the estimated value of the total reward received n time-steps. This representation is also very useful in estimating values for nonstationary problems ie. problems where the probability of reward for a given action $p(r\vert a)$ changes with time.

Action Value Estimation: A Bandit Perspective

How to calculate action value estimates

How to calculate action value estimates \(Q(a)\) / estimate \(q_{*}(a)\)?

Incremental Estimate of Action Values:

General Estimation of Action Values:

**How to calculate action value estimates \(Q(a)\) / estimate \(q_{*}(a)\)?**