Why does the "reward to go" trick in policy gradient methods work?



In policy gradient method, there's a trick to reduce a variance of policy gradient. We use causality, and remove part of the sum over rewards so that only actions happened after the reward are taken into account (See here http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf, slide 18).

Why does it work? I understand the intuitive explanation, but what's the rigorous proof of it? Can you point me to some papers?

Konstantin Solomatov

Posted 2018-12-20T01:00:04.310

Reputation: 258



An important thing we're going to need is what is called the "Expected Grad-Log-Prob Lemma here" (proof included on that page), which says that (for any $t$):

$$\mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta}(a_t \mid s_t) \right] = 0.$$

Taking the analytical expression of the gradient (from, for example, slide 9) as a starting point:

$$\begin{aligned} \nabla_{\theta} J(\theta) &= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \left( \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \right) \left( \sum_{t=1}^T r(s_t, a_t) \right) \right] \\ % &= \sum_{t=1}^T \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=1}^T r(s_{t'}, a_{t'}) \right] \\ % &= \sum_{t=1}^T \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) + \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \\ % &= \sum_{t=1}^T \left( \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) \right] \\ + \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \right) \\ \end{aligned}$$

At the $t^{th}$ "iteration" of the outer sum, the complete sum $\sum_{t'=1}^{t-1} r(s_{t'}, a_{t'})$ is independent of the trajectory $\tau$ due to the Markov property, which means we're allowed to pull that expression out of the expectation:

$$\nabla_{\theta} J(\theta) = \sum_{t=1}^T \left( \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \right] \\ + \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \right)$$

The first expectation can now be replaced by $0$ due to the lemma mentioned at the top of the post:

$$ \begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{t=1}^T \left( \sum_{t'=1}^{t-1} r(s_{t'}, a_{t'}) \times 0 + \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \right) \\ % &= \sum_{t=1}^T \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \left[ \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right] \\ % &= \mathbb{E}_{\tau \sim \pi_{\theta}(\tau)} \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (a_t \mid s_t) \left( \sum_{t'=t}^T r(s_{t'}, a_{t'}) \right). \\ \end{aligned} $$

The expression on slide 18 of the linked slides is an unbiased, sample-based estimator of this gradient:

$$\nabla_{\theta} J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \sum_{t=1}^T \nabla_{\theta} \log \pi_{\theta} (a_{i, t} \mid s_{i, t}) \left( \sum_{t'=t}^T r(s_{i, t'}, a_{i, t'}) \right)$$

For a more formal treatment of the claim that we can pull $\sum_{t'=1}^{t-1} r(s_{t'}, a_{t'})$ out of an expectation due to the Markov property, see this page: https://spinningup.openai.com/en/latest/spinningup/extra_pg_proof1.html

Dennis Soemers

Posted 2018-12-20T01:00:04.310

Reputation: 7 644