Bellman Backup State Value Equations Derivation

Understanding continuum through discretization

\[G_t = \lambda^0 R_{t+1} + \lambda^1 R_{t+2} + \dots + \lambda^{T-1} R_{T}\] \[G_t = \sum_{k = t+1}^T \gamma^{k-t-1} R_k\]

we can have either \(T = \infty\) or \(\gamma = 1\) but not both.

\[\begin{equation}\begin{split} V_{\pi}(s) &\doteq \mathbb{E}_{\pi}[G_t|S_t=s] \\ &= \mathbb{E}_{\pi}[R_{t+1} + \gamma G_{t+1} | S_t = s] \\ &= \mathbb{E}_\pi[R_{t+1} | S_t = s] + \gamma\mathbb{E}_\pi[G_{t+1} | S_t = s]\end{split}\end{equation}\]

and,

\[\begin{equation}\begin{split}V_{\pi}(s') &\doteq \mathbb{E}_{\pi}[G_{t+1}|S_{t+1}=s'] \end{split}\end{equation}\]

\((1)\) and \((2)\) are same if \(s=s’\)

\[\begin{equation}\mathbb{E}_\pi[R_{t+1} | S_t = s] = \sum_{r\in \textbf{R}}rp(r|s)\end{equation}\]

Important Consideration from the definition of a markov decision process:

note that \(r \doteq r(s)\) which basically means that we have considered a deterministic reward and a fully observable markov decision process and \(p(s’|s,a) = p(s’,r|s,a)\) because \(s'\) is associated with a single reward \(r(s')=r(s,a)\)

hence:

\[\begin{equation}p(r|s)= \sum_{s'}\sum_ap(a,s',r|s) = \sum_{s'}\sum_a\pi(a|s)p(s',r|s,a)\end{equation}\] \[\begin{equation}\begin{split}\mathbb{E}_\pi[R_{t+1} | S_t = s] &= \sum_{r\in \textbf{R}}r\sum_{s'}\sum_a\pi(a|s)p(s',r|s,a) \\ &=\sum_{r\in \textbf{R}}\sum_{s'}\sum_ar\pi(a|s)p(s',r|s,a)\end{split}\end{equation}\]

substituting in \((1)\) we have:

\[\begin{equation}\begin{split}V_{\pi}(s) &= \sum_{r\in \textbf{R}}\sum_{s'}\sum_ar\pi(a|s)p(s',r|s,a) + \gamma\mathbb{E}_\pi[G_{t+1} | S_t = s]\end{split}\end{equation}\]

similarly for \(\mathbb{E}_\pi[G_{t+1} | S_t = s]\) we have:

\[\begin{equation}\begin{split}\mathbb{E}_\pi[G_{t+1} | S_t = s] &= \mathbb{E}_{\pi}[R_{t+2}+\gamma G_{t+2} | S_{t+1} = s'] \\ &=\mathbb{E}_{\pi}[R_{t+2} | S_{t+1} = s' ] + \gamma\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s']\end{split}\end{equation}\]

substituting \((7)\) in \((6)\) we have

\[\begin{equation}\begin{split}V_{\pi}(s) &= \sum_{r\in \textbf{R}}\sum_{s' \in S}\sum_{a \in A}r\pi(a|s)p(s',r|s,a) + \gamma(\mathbb{E}_{\pi}[R_{t+2} | S_{t+1} = s' ] + \gamma\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s']) \\ &= \sum_{r\in \textbf{R}}\sum_{s' \in S}\sum_{a \in A}r\pi(a|s)p(s',r|s,a) + \gamma\sum_{r'\in \textbf{R}}\sum_{s'' \in S}\sum_{a' \in A}r'\pi(a'|s')p(s'',r'|s',a') + \gamma^2\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s']\end{split}\end{equation}\] \[\begin{equation}\begin{split} &=\sum_{a \in A}\pi(a|s)\sum_{r\in \textbf{R}}\sum_{s' \in S}rp(s',r|s,a) + \gamma\sum_{a' \in A}\pi(a'|s')\sum_{r'\in \textbf{R}}\sum_{s'' \in S}r'p(s'',r'|s',a') + \gamma^2\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s'] \\ &= \sum_{a \in A}\pi(a|s)\sum_{r\in \textbf{R}}\sum_{s' \in S}p(s',r|s,a)(r + \gamma(\sum_{a' \in A}\pi(a'|s')\sum_{r'\in \textbf{R}}\sum_{s'' \in S}r'p(s'',r'|s',a') + \gamma\mathbb{E}_\pi[G_{t+2} | S_{t+1} = s'])) \\ &= \sum_{a \in A}\pi(a|s)\sum_{r\in \textbf{R}}\sum_{s' \in S}p(s',r|s,a)(r + \gamma(V_{\pi}(s'))) \end{split}\end{equation}\] \[\begin{equation}\begin{split}\mathbb{E}_\pi[R_{t+1} | S_t = s] &=\sum_{s'}\sum_{a}r(s)\pi(a|s)p(s',r|s,a)\end{split}\end{equation}\]

and,

\[\begin{equation}\begin{split}V_{\pi}(s)&= \sum_{a \in A}\pi(a|s)\sum_{s' \in S}p(s',r|s,a)(r(s) + \gamma(V_{\pi}(s'))) \end{split}\end{equation}\]

Here \(p(s’,r|s,a)\) is called the dynamics of the agent in the environment and \(\pi(a|s)\) is called the policy.