# Action Value Estimation: A Bandit Perspective

A gentle introduction to the fundamentals of reinforcement learning

## How to calculate action value estimates

### How to calculate action value estimates $$Q(a)$$ / estimate $$q_{*}(a)$$?

There are many ways to estimate $$q_{*}(a)$$, one way is sample-average method:

by sample average method, for $n$th timestep:

$Q_{n}(a) = \frac{ \sum_{i=1}^{n-1} R_{i} \cdot \mathbb{Y}_{A_{i} =a} \ }{\sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}$

Here $$\mathbb{Y}_{A_{i}= a} = \begin{cases}1 \ \ \ \text{if} \ \ \ A_{i} = a \\ 0 \ \ \ \text{otherwise} \end{cases}$$

$$R_{i} =$$ reward received at time-step $$i$$

if denominator is zero, we define the default value as $$Q(a) = 0$$

The goal of the agent is to maximize the value which is given by

$\underset{a}{argmax} \ \ Q_{n}(a)$

This is called Greedy Action Selection.

Alternatively, The agent can choose the greedy action with a probability $$\epsilon$$ and non-greedy actions with a probability $$1-\epsilon$$. This is called epsilon greedy action selection.

$A_t \leftarrow \begin{cases}\underset{a}{argmax} Q_{t}(a) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{with probability} \ \ 1- \epsilon \\ \\ a \sim Uniform(\{a_1 \dots a_k\}) \ \text{with probability} \ \ \epsilon\end{cases}$

## Incremental Estimate of Action Values:

$Q_{n+1} = \frac{\sum_{i=1}^{n}R_{i}.\mathbb{Y}_{A_{i = a}}}{\sum_{i=1}^{n}\mathbb{Y}_{A_{i} = a}}$ $= \frac{R_{n}.\mathbb{Y}_{A_{n = a}} + \sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a} \cdot \frac{\sum_{i=1}^{n-1}R_{i}.\mathbb{Y}_{A_{i = a}}}{\sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}}{\sum_{i=1}^{n}\mathbb{Y}_{A_{i} = a}}$ $= \frac{R_{n}\cdot \mathbb{Y}_{A_{n=a}} + Q_{n} \cdot \sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}{\mathbb{Y}_{A_{n=a}} + \sum_{i = 1}^{n-1}\mathbb{Y}_{A_{i = a}}}$

We already know that $$\mathbb{Y}_{A_{n=a}} = 1$$

Let the number of times the action $a$ has been taken until the time-step $$n$$+1 defined by $$\sum_{i=1}^{n}\mathbb{Y}_{A_{i = a}}$$ be $$k$$

Hence the number of times the action $a$ has been taken until the time-step n defined by $$\sum_{i=1}^{n-1} \mathbb{Y}_{A_{i=a}}$$ becomes $$k-1$$

$= \frac{R_{n} + Q_{n} \cdot (k-1)}{k}$ $\implies Q_{n+1} = Q_n + \frac{1}{k} (R_{n} - Q_{n})$

If $$\frac{1}{k}$$ is considered as step-size $$\alpha_{n}$$ then the equation becomes:

$Q_{n+1} = Q_{n} + \alpha_{n}(R_{n} - Q_{n})$

## General Estimation of Action Values:

$Q_{n+1} = Q_{n} + \alpha_{n}(R_{n}-Q_{n})$

we replace $$Q_{n}$$ with the above incremental update equation

$Q_{n+1} = Q_{n-1} +\alpha_{n-1}(R_{n-1}-Q_{n-1}) + \alpha_{n}(R_{n} - Q_{n-1} -\alpha_{n-1}(R_{n-1}-Q_{n-1}))$

if $$\alpha_{n}$$ is a fixed value say $$\alpha$$

$Q_{n+1} = \alpha R_{n}+ (\alpha -\alpha^{2})R_{n-1}+(1-\alpha)^2Q_{n-1}$ $=\alpha R_{n}+(1-\alpha)\alpha R_{n-1} + (1-\alpha)^2 Q_{n-1}$

$$Q_{n-1}$$ can be substituted recursively further to realise

$=(1-\alpha)^n Q_{1} + \sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_{i}$

This is also called the General Estimation Rule of action values which signifies that the most recent rewards have highest influence on the estimated value of the total reward received n time-steps. This representation is also very useful in estimating values for nonstationary problems ie. problems where the probability of reward for a given action $$p(r\vert a)$$ changes with time.