Action Value Estimation: A Bandit Perspective

A gentle introduction to the fundamentals of reinforcement learning

How to calculate action value estimates

How to calculate action value estimates \(Q(a)\) / estimate \(q_{*}(a)\)?

There are many ways to estimate \(q_{*}(a)\), one way is sample-average method:

by sample average method, for $n$th timestep:

\[Q_{n}(a) = \frac{ \sum_{i=1}^{n-1} R_{i} \cdot \mathbb{Y}_{A_{i} =a} \ }{\sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}\]

Here \(\mathbb{Y}_{A_{i}= a} = \begin{cases}1 \ \ \ \text{if} \ \ \ A_{i} = a \\ 0 \ \ \ \text{otherwise} \end{cases}\)

\(R_{i} =\) reward received at time-step \(i\)

if denominator is zero, we define the default value as \(Q(a) = 0\)

The goal of the agent is to maximize the value which is given by

\[\underset{a}{argmax} \ \ Q_{n}(a)\]

This is called Greedy Action Selection.

Alternatively, The agent can choose the greedy action with a probability \(\epsilon\) and non-greedy actions with a probability \(1-\epsilon\). This is called epsilon greedy action selection.

\[A_t \leftarrow \begin{cases}\underset{a}{argmax} Q_{t}(a) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{with probability} \ \ 1- \epsilon \\ \\ a \sim Uniform(\{a_1 \dots a_k\}) \ \text{with probability} \ \ \epsilon\end{cases}\]

Incremental Estimate of Action Values:

\[Q_{n+1} = \frac{\sum_{i=1}^{n}R_{i}.\mathbb{Y}_{A_{i = a}}}{\sum_{i=1}^{n}\mathbb{Y}_{A_{i} = a}}\] \[= \frac{R_{n}.\mathbb{Y}_{A_{n = a}} + \sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a} \cdot \frac{\sum_{i=1}^{n-1}R_{i}.\mathbb{Y}_{A_{i = a}}}{\sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}}{\sum_{i=1}^{n}\mathbb{Y}_{A_{i} = a}}\] \[= \frac{R_{n}\cdot \mathbb{Y}_{A_{n=a}} + Q_{n} \cdot \sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}{\mathbb{Y}_{A_{n=a}} + \sum_{i = 1}^{n-1}\mathbb{Y}_{A_{i = a}}}\]

We already know that \(\mathbb{Y}_{A_{n=a}} = 1\)

Let the number of times the action $a$ has been taken until the time-step \(n\)+1 defined by \(\sum_{i=1}^{n}\mathbb{Y}_{A_{i = a}}\) be \(k\)

Hence the number of times the action $a$ has been taken until the time-step n defined by \(\sum_{i=1}^{n-1} \mathbb{Y}_{A_{i=a}}\) becomes \(k-1\)

\[= \frac{R_{n} + Q_{n} \cdot (k-1)}{k}\] \[\implies Q_{n+1} = Q_n + \frac{1}{k} (R_{n} - Q_{n})\]

If \(\frac{1}{k}\) is considered as step-size \(\alpha_{n}\) then the equation becomes:

\[Q_{n+1} = Q_{n} + \alpha_{n}(R_{n} - Q_{n})\]

General Estimation of Action Values:

\[Q_{n+1} = Q_{n} + \alpha_{n}(R_{n}-Q_{n})\]

we replace \(Q_{n}\) with the above incremental update equation

\[Q_{n+1} = Q_{n-1} +\alpha_{n-1}(R_{n-1}-Q_{n-1}) + \alpha_{n}(R_{n} - Q_{n-1} -\alpha_{n-1}(R_{n-1}-Q_{n-1}))\]

if \(\alpha_{n}\) is a fixed value say \(\alpha\)

\[Q_{n+1} = \alpha R_{n}+ (\alpha -\alpha^{2})R_{n-1}+(1-\alpha)^2Q_{n-1}\] \[=\alpha R_{n}+(1-\alpha)\alpha R_{n-1} + (1-\alpha)^2 Q_{n-1}\]

\(Q_{n-1}\) can be substituted recursively further to realise

\[=(1-\alpha)^n Q_{1} + \sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_{i}\]

This is also called the General Estimation Rule of action values which signifies that the most recent rewards have highest influence on the estimated value of the total reward received n time-steps. This representation is also very useful in estimating values for nonstationary problems ie. problems where the probability of reward for a given action \(p(r\vert a)\) changes with time.