A gentle introduction to the fundamentals of reinforcement learning
There are many ways to estimate \(q_{*}(a)\), one way is sample-average method:
by sample average method, for $n$th timestep:
\[Q_{n}(a) = \frac{ \sum_{i=1}^{n-1} R_{i} \cdot \mathbb{Y}_{A_{i} =a} \ }{\sum_{i=1}^{n-1}\mathbb{Y}_{A_{i} = a}}\]Here \(\mathbb{Y}_{A_{i}= a} = \begin{cases}1 \ \ \ \text{if} \ \ \ A_{i} = a \\ 0 \ \ \ \text{otherwise} \end{cases}\)
\(R_{i} =\) reward received at time-step \(i\)
if denominator is zero, we define the default value as \(Q(a) = 0\)
The goal of the agent is to maximize the value which is given by
\[\underset{a}{argmax} \ \ Q_{n}(a)\]This is called Greedy Action Selection.
Alternatively, The agent can choose the greedy action with a probability \(\epsilon\) and non-greedy actions with a probability \(1-\epsilon\). This is called epsilon greedy action selection.
\[A_t \leftarrow \begin{cases}\underset{a}{argmax} Q_{t}(a) \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{with probability} \ \ 1- \epsilon \\ \\ a \sim Uniform(\{a_1 \dots a_k\}) \ \text{with probability} \ \ \epsilon\end{cases}\]We already know that \(\mathbb{Y}_{A_{n=a}} = 1\)
Let the number of times the action $a$ has been taken until the time-step \(n\)+1 defined by \(\sum_{i=1}^{n}\mathbb{Y}_{A_{i = a}}\) be \(k\)
Hence the number of times the action $a$ has been taken until the time-step n defined by \(\sum_{i=1}^{n-1} \mathbb{Y}_{A_{i=a}}\) becomes \(k-1\)
\[= \frac{R_{n} + Q_{n} \cdot (k-1)}{k}\] \[\implies Q_{n+1} = Q_n + \frac{1}{k} (R_{n} - Q_{n})\]If \(\frac{1}{k}\) is considered as step-size \(\alpha_{n}\) then the equation becomes:
\[Q_{n+1} = Q_{n} + \alpha_{n}(R_{n} - Q_{n})\]we replace \(Q_{n}\) with the above incremental update equation
\[Q_{n+1} = Q_{n-1} +\alpha_{n-1}(R_{n-1}-Q_{n-1}) + \alpha_{n}(R_{n} - Q_{n-1} -\alpha_{n-1}(R_{n-1}-Q_{n-1}))\]if \(\alpha_{n}\) is a fixed value say \(\alpha\)
\[Q_{n+1} = \alpha R_{n}+ (\alpha -\alpha^{2})R_{n-1}+(1-\alpha)^2Q_{n-1}\] \[=\alpha R_{n}+(1-\alpha)\alpha R_{n-1} + (1-\alpha)^2 Q_{n-1}\]\(Q_{n-1}\) can be substituted recursively further to realise
\[=(1-\alpha)^n Q_{1} + \sum_{i=1}^{n}\alpha(1-\alpha)^{n-i}R_{i}\]This is also called the General Estimation Rule of action values which signifies that the most recent rewards have highest influence on the estimated value of the total reward received n time-steps. This representation is also very useful in estimating values for nonstationary problems ie. problems where the probability of reward for a given action \(p(r\vert a)\) changes with time.