Motivations for Analysis and key concepts
In a single arm bandit, we can simplify this to:
because \(Q_{k}(a) = Q_{k}\) [action dropped as there is only one action in a single bandit problem and \(N_t(a)\) simplifies to just \(n\) as that is the number of times the arm was pulled]
\[\begin{align*} &Q_1 = R_1 \\ &Q_2 = \frac{R_1 + R_2}{2} \\ &Q_3 = \frac{R_1 + R_2 + R_3}{3} \\ &\vdots\\ &Q_{k+1} = \frac{1}{k} \sum_{i=1}^{k} R_i \end{align*}\] \[\begin{equation*} \begin{split} Q_{k+1} &= \frac{1}{k} \left( R_k + \sum_{i=1}^{k-1}R_i\right)\\ &= \frac{1}{k} \left( R_k + (k-1)Q_k\right)\\ &= \frac{1}{k} \left( R_k + (k-1)Q_k + Q_k - Q_k\right)\\ &= \frac{1}{k} \left( R_k + kQ_k - Q_k\right)\\ &= Q_k + \frac{1}{k} \left( R_k - Q_k\right)\\ \end{split} \end{equation*}\]where only \(Q_k\) and \(R_k\) from the the previous time are used to update the Q value. This is called the incremental update rule.
Note that a similar equation can be written for a multi-arm bandit problem and it would look like this
\[Q_{k+1}(a) = Q_k(a) + \frac{1}{N_k(a)}(R_k - Q_k(a))\]The way to look at it is: \(\text{current value = previous value + scaling factor . [target value - previous value]}\)
which basically says that we are taking a step of size scaling factor towards the target value given the current value.
Optimistic Initial Values : to encourage exploration- setting an optimistic initial value explores all possible actions during the initial few interactions of the agent with the environment
Exploration-Exploitation Tradeoff : \(\epsilon\)-greedy, softmax, boltzmann exploration, upper confidence bound, thompson sampling
Assymptotic Convergence :
Bias-Variance Tradeoff :
PAC Optimality/Learning : Probably Approximately Correct learning
Contextual Bandits :
Adversarial Bandits :
Reinforcement Learning :
Non-Stationary Bandits :