Bellman Backup State Value Equations Derivation

Behold the beauty of understanding continuum through discretization

Gain of an agent \(G_t\) is defined at a point in time \(t\) as the discounted weighted sum of rewards obtained by the agent from its environment after executing an action \(a\) at that time \(t\) and receiving a reward \(R_{t+1}\) at time \(t+1\) and so on …

\[G_t = \lambda^0 R_{t+1} + \lambda^1 R_{t+2} + \dots + \lambda^{T-1} R_{T}\] \[G_t = \sum_{k = t+1}^T \gamma^{k-t-1} R_k\]

we can have either \(T = \infty\) or \(\gamma = 1\) but not both.

because…

If the task is continuing/ non-terminating/ non-episodic, then we use discounted rewards - ie weighting by a factor that is less than one and continuously diminishing - otherwise the sum of weighted rewards can blow up to infinity. But if the task is episodic, then the weighting can be a simple sum as well.

Let us assume the agent has a way (or) preference (or) probability for certain actions when it is in certain states. Meaning it has a policy \(\pi(a|s)\) . Then the value with respect to this policy given a starting state \(s\) at that time instant \(t\) defined as \(S_t= s\) is defined as:

\[\begin{equation}\begin{split} V_{\pi}(s) &\doteq \mathbb{E}_{\pi}[G_t|S_t=s] \\ &= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\ &= \mathbb{E}_\pi[R_{t+1} | S_t = s] + \gamma\mathbb{E}_\pi[G_{t+1} | S_t = s]\end{split}\end{equation}\]

and,

\[\begin{equation}\begin{split}V_{\pi}(s') &\doteq \mathbb{E}_{\pi}[G_{t+1}|S_{t+1}=s'] \end{split}\end{equation}\]

\((1)\) and \((2)\) are same if \(s=s’\) which basically means we have the same value function given a policy irrespective of the time dependent state/ state transitions

We can write expectation \(\mathbb{E}_{\pi}[R_{t+1} | S_t = s]\) as a weighted sum over the distribution of rewards for that state as:

\[\begin{equation}\mathbb{E}_\pi[R_{t+1} | S_t = s] = \sum_{r\in \textbf{R}}rp(r|s)\end{equation}\]

Important Consideration from the definition of a markov decision process:

note that \(r \doteq r(s)\) which basically means that we have considered a deterministic reward and a fully observable markov decision process and \(p(s’|s,a) = p(s’,r|s,a)\) because \(s'\) is associated with a single reward \(r(s')\)

hence:

\[\begin{equation}p(r|s)= \sum_{s'}\sum_ap(a,s',r|s) = \sum_{s'}\sum_a\pi(a|s)p(s',r|s,a)\end{equation}\]

Other Key considerations: note that \(s’\) and \(s\) in the above are only used to show the significance of state transition and its effect and contribution to the distribution of rewards \(p(r|s)\) but otherwise \(s, s’ \in \textbf{S}\) . The underlying definition of rewards says \(p(r|s)\) is time independent for a given MDP i.e. reward structure should not change for the MDP after instantiation and it is defined for such an \(s\). An other underlying assumption of the MDP is that action is central to the dynamics of the agent in its environment.

putting it together in \(\mathbb{E}_\pi[R_{t+1} | S_t = s]\) we have:

\[\begin{equation}\begin{split}\mathbb{E}_\pi[R_{t+1} | S_t = s] &= \sum_{r\in \textbf{R}}r\sum_{s'}\sum_a\pi(a|s)p(s',r|s,a) \\ &=\sum_{r\in \textbf{R}}\sum_{s'}\sum_ar\pi(a|s)p(s',r|s,a)\end{split}\end{equation}\]

substituting in \((1)\) we have:

\[\begin{equation}\begin{split}V_{\pi}(s) &= \sum_{r\in \textbf{R}}\sum_{s'}\sum_ar\pi(a|s)p(s',r|s,a) + \gamma\mathbb{E}_\pi[G_{t+1} | S_t = s]\end{split}\end{equation}\]

similarly for \(\mathbb{E}_\pi[G_{t+1} | S_t = s]\) we have:

\[\begin{equation}\begin{split}\mathbb{E}_\pi[G_{t+1} | S_t = s] &= \mathbb{E}_{\pi}[R_{t+2}+\gamma G_{t+2} | S_{t+1} = s'] \\ &=\mathbb{E}_{\pi}[R_{t+2} | S_{t+1} = s' ] + \gamma\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s']\end{split}\end{equation}\]

substituting \((7)\) in \((6)\) we have

\[\begin{equation}\begin{split}V_{\pi}(s) &= \sum_{r\in \textbf{R}}\sum_{s' \in S}\sum_{a \in A}r\pi(a|s)p(s',r|s,a) + \gamma(\mathbb{E}_{\pi}[R_{t+2} | S_{t+1} = s' ] + \gamma\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s']) \\ &= \sum_{r\in \textbf{R}}\sum_{s' \in S}\sum_{a \in A}r\pi(a|s)p(s',r|s,a) + \gamma\sum_{r'\in \textbf{R}}\sum_{s'' \in S}\sum_{a' \in A}r'\pi(a'|s')p(s'',r'|s',a') + \gamma^2\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s']\end{split}\end{equation}\]

reward is a function of state but gain is a function of trajectory/ sequence of states.

The agent executes an action/ interacts with the environment and only then gets a state from the environment hence action is the first thing in an MDP process. This is the reason the summation over actions can be taken outside/ common

\[\begin{equation}\begin{split} &=\sum_{a \in A}\pi(a|s)\sum_{r\in \textbf{R}}\sum_{s' \in S}rp(s',r|s,a) + \gamma\sum_{a' \in A}\pi(a'|s')\sum_{r'\in \textbf{R}}\sum_{s'' \in S}r'p(s'',r'|s',a') + \gamma^2\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s'] \\ &= \sum_{a \in A}\pi(a|s)\sum_{r\in \textbf{R}}\sum_{s' \in S}p(s',r|s,a)(r + \gamma(\sum_{a' \in A}\pi(a'|s')\sum_{r'\in \textbf{R}}\sum_{s'' \in S}r'p(s'',r'|s',a') + \gamma\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s'])) \\ &= \sum_{a \in A}\pi(a|s)\sum_{r\in \textbf{R}}\sum_{s' \in S}p(s',r|s,a)(r + \gamma(V_{\pi}(s'))) \end{split}\end{equation}\]

but because \(r \doteq r(s) = r(s’)\) \([\)as reward is a consequence of state (transition) \(]\) summing over rewards is equal to summing over next states hence in \((5)\) and \((9)\) we can safely drop one summation and rewrite each of them as follows:

\[\begin{equation}\begin{split}\mathbb{E}_\pi[R_{t+1} | S_t = s] &=\sum_{s'}\sum_{a}r(s)\pi(a|s)p(s',r|s,a)\end{split}\end{equation}\]

and,

\[\begin{equation}\begin{split}V_{\pi}(s)&= \sum_{a \in A}\pi(a|s)\sum_{s' \in S}p(s',r|s,a)(r(s) + \gamma(V_{\pi}(s'))) \end{split}\end{equation}\]

Here \(p(s’,r|s,a)\) is called the dynamics of the agent in the environment and \(\pi(a|s)\) is called the policy.