1

I am following the OpenAI's spinning up tutorial Part 3: Intro to Policy Optimization. It is mentioned there that the reward-to-go reduces the variance of the policy gradient. While I understand the intuition behind it, I struggle to find a proof in the literature.

1

Does the answer to this question answer yours as well?

– user5093249 – 2020-06-10T13:55:33.057No, the linked question only proofs that the reward-to-go does not introduce any bias to the gradient estimate. – sirKris van Dela – 2020-06-10T14:14:26.923

This is nontrivial to prove, actually anything involving stochastic function approximation is nontrivial. You can search research papers, you won't find it in any book right now – FourierFlux – 2020-06-10T14:33:49.653