Sequential Decision Making

A gentle introduction to the fundamentals of reinforcement learning

In reinforcement learning, the agent generates its own training data by interacting with the world.

This means that there is no fixed desired output data for given input(state, action) before the agent starts interacting with the environment.

Reinforcement learning is a framework for sequential decision making under uncertainity(of either the model of the environment, or the model of the agent or both).

The agent also does not have a predetermined action at each state.

It infact evaluates the set of actions that it can take at each state (probabilities for action given state) by measuring the reward that it is getting from the environment after interacting with the environment (or executing a specific action)

The benefit of a reward can be observed by the agent in short term knows as instantaneous reward.

The benefit of a reward can be observed/evaluated (expected) by the agent at the end of a task or over large number of time-steps, known as long-term reward.

Definitions:

  1. Action : The control input the agent can choose/apply to interact with its environment. These inputs more often than not affect the state of the agent. The dimension of the action vector is equal to the dimension of the number of control inputs of the agent.

    \[U \subset \mathbb{R}^n ,\\ a \in A, \ A = \{ u \ | \ u \in U \}\]

    where \(U\) is the control input subspace and \(A\) is the action set.

  2. Reward : If the agent achieves a desired state, it gets a reward. The reward is a scalar value.

    \[R \subset \mathbb{R} ,\\ r \in R\]

    Reward is an immediate indicator of how good an action is.

    The goal of the reinforcement learning agent is to maximize its total reward.

    Note that at each time-step the number of possible rewards is equal to or more than the number of possible actions.

    This means the dimension of the event space of rewards is larger than or equal to the dimension of the event space of actions.

    In a scenario when the agent has to track a desired state the problem may be formulated to have intermediate rewards for intermediate states and a maximum final reward to achieve a final state.

  3. (Action) Value : It is the expected reward at the $t$ th time-step when ( / after) the action $A_t = a$ is taken.

\[q_{*}(a) = \mathbb{E}[R_{t} |A_{t} = a]\ \ \ \forall a \in \{1,\dots,k\}\] \[=\sum_{r} p(r \vert a)r\]

Here, time step \(t\) and action \(a\) are fixed.

for each action \(a \in A\) at time step t there is a \(p(r \vert a)\) probabilty for the reward r to be given by the environment. Hence the reward $R_{t}$ at time-step t is an expectation or the probabilistic weighted sum of individual rewards.

\(R_{t}\) can’t be directly given because \(R_t\) is a probabilistic variable that depends on the probability of rewards for the action $a$ at the time-step $t$, leading to the probablistic sum.

Note that we do not know \(p(r \vert a)\) with certainity because of our lack of knowlege of the agent and the environment. It gets better with each time step/ action.

because we do not know \(p(r \vert a)\) at each time step we do not exactly know \(q_{*}(a)\). We only know its estimate given by \(Q(a)\) which we want to be as close to \(q_{*}(a)\) as possible.

What is \(Q(a)\)?

\(Q(a)\) is the estimate of \(q_{*}(a)\)

Why is \(q_{*}(a)\) an expectation of \(R_{t}\) ?

because any random scalar reward $r$ from the set of possible rewards can be obtained with a probability \(p(r \vert a)\).

How to read \(q_{*}(a)\)?

This can be read as: The expectation of the reward \(R_{t}\) that the agent will get at time \(t\) when an action \(A_{t} = a\) is chosen.

What happens if all actions are awarded equally at a time-step/ what happens if all rewards are equally likely at time-step?

Note how the expectation becomes a simple average of the rewards if a selected action receives an equally likely reward.

\[= \frac{1}{n}\sum_{i=1}^{n}r_{i} \ \ \forall r_{i} \in \{r_{1},\dots,r_{n}\}\]