Important Ideas from Bandit Algorithms

Motivations for Analysis and key concepts

1. Key Ideas

Incremental Update Rule : Many ideas in bandit algorithms are motivated by the need to simplify the update rules and make them more efficient. This is one such idea. Basically built to avoid storing the trajectory of rewards for computing q values. See sequential decision making for the motivation of \(Q(a)\) given below \(Q_t(a) = \frac{R_1 + R_2 + R_3 + \cdots + R_{N_t(a)}}{N_t(a)}\)

In a single arm bandit, we can simplify this to:

because \(Q_{k}(a) = Q_{k}\) [action dropped as there is only one action in a single bandit problem and \(N_t(a)\) simplifies to just \(n\) as that is the number of times the arm was pulled]

\[\begin{align*} &Q_1 = R_1 \\ &Q_2 = \frac{R_1 + R_2}{2} \\ &Q_3 = \frac{R_1 + R_2 + R_3}{3} \\ &\vdots\\ &Q_{k+1} = \frac{1}{k} \sum_{i=1}^{k} R_i \end{align*}\] \[\begin{equation*} \begin{split} Q_{k+1} &= \frac{1}{k} \left( R_k + \sum_{i=1}^{k-1}R_i\right)\\ &= \frac{1}{k} \left( R_k + (k-1)Q_k\right)\\ &= \frac{1}{k} \left( R_k + (k-1)Q_k + Q_k - Q_k\right)\\ &= \frac{1}{k} \left( R_k + kQ_k - Q_k\right)\\ &= Q_k + \frac{1}{k} \left( R_k - Q_k\right)\\ \end{split} \end{equation*}\]

where only \(Q_k\) and \(R_k\) from the the previous time are used to update the Q value. This is called the incremental update rule.

Note that a similar equation can be written for a multi-arm bandit problem and it would look like this

\[Q_{k+1}(a) = Q_k(a) + \frac{1}{N_k(a)}(R_k - Q_k(a))\]

The way to look at it is: \(\text{current value = previous value + scaling factor . [target value - previous value]}\)

which basically says that we are taking a step of size scaling factor towards the target value given the current value.

Optimistic Initial Values : to encourage exploration- setting an optimistic initial value explores all possible actions during the initial few interactions of the agent with the environment
Exploration-Exploitation Tradeoff :
- \(\epsilon\)-greedy :
- softmax :
- boltzmann exploration :
- upper confidence bound :
- thompson sampling :
Assymptotic Convergence :
Bias-Variance Tradeoff :
PAC Optimality/Learning : Probably Approximately Correct learning
Stochastic Bandits :
- Policy Search and Stochastic Averaging :
  - Learning Automata and Variable Structure Finite State Machines :
- REINFORCE :
- Bayesian Bandits and Thompson Sampling :
Contextual Bandits :
Adversarial Bandits :
Reinforcement Learning :
Non-Stationary Bandits :
Regret Optimality : rate of increase of regret is a measure of the performance of the algorithm. it is generally lower bounded by \(\Omega(\log T)\) where T is the number of time steps