Understanding proof of lemma 1 (policy improvement bound) of the "Trust Region Policy Optimization" paper


In the Trust Region Policy Optimization paper, in Lemma 1 of Appendix A, I did not quite understand the transition from (21) from (20). In going from (20) to (21), $A^\pi(s_t, a_t)$ is substituted with its value. The value of $A^\pi(s_t, a_t)$ is given as $\mathbb{E}_{s'∼P(s'|s,a)}[r(s) + \gamma V_\pi(s') − V_\pi(s)]$ at the very begining of the proof. But when $A^\pi(s_t, a_t)$ gets substituted, I don't see the expectation (over $s'∼P(s'|s,a)$) appearing anywhere. It will be of great help if somebody lends some light on this.

A Das

Posted 2019-11-21T22:38:18.797

Reputation: 131



Let's assume $\gamma = 1$ to simplify things \begin{align} \mathbb E_{\tau|\pi} [\sum_{t = 0}^{\infty} A_\pi(s_t, a_t)] &= \mathbb E_{\tau|\pi}[A_\pi(s_0, a_0) + \ldots A_\pi(s_i, a_i) + \ldots]\\ &= \mathbb E_{a_0 \sim \pi,s_1 \sim P(s_1|s_0, a_0)}[A_\pi(s_0, a_0)] + \ldots + \mathbb E_{a_i \sim \pi,s_{i+1} \sim P(s_{i+1}|s_i, a_i)}[A_\pi(s_i, a_i)] + \ldots \end{align} if we observe only $i$-th timestep \begin{align} \mathbb E_{a_i \sim \pi,s_{i+1} \sim P(s_{i+1}|s_i, a_i)}[A_\pi(s_i, a_i)] &= \sum_{a'} (\mathbb E_{s_{i+1} \sim P(s_{i+1}|s_i, a_i)}[A_\pi(s_i, a')]) \pi(a'|s_i)\\ &= \sum_{a'} (\mathbb E_{s_{i+1} \sim P(s_{i+1}|s_i, a_i)}[\mathbb E_{s_{i+1} \sim P(s_{i+1}|s_i, a_i)}[r(s_i) + V_\pi(s_{i+1}) - V_\pi(s)]]) \pi(a'|s_i) \end{align}

\begin{equation} \mathbb E[\mathbb E[f]] = \mathbb E[f] \end{equation}

\begin{align} \mathbb E_{a_i \sim \pi,s_{i+1} \sim P(s_{i+1}|s_i, a_i)}[A_\pi(s_i, a_i)] &= \sum_{a'} (\mathbb E_{s_{i+1} \sim P(s_{i+1}|s_i, a_i)}[r(s_i) + V_\pi(s_{i+1}) - V_\pi(s)]) \pi(a'|s_i)\\ &= \mathbb E_{a_i \sim \pi,s_{i+1} \sim P(s_{i+1}|s_i, a_i)}[r(s_i) + V_\pi(s_{i+1}) - V_\pi(s)] \end{align}

now for all timesteps sum everything up.


Posted 2019-11-21T22:38:18.797

Reputation: 1 664