What is the proof that the variance of the gradient estimate in Actor-Critic is smaller than in REINFORCE?



The intuition provided when introducing actor-critic algorithms is that the variance of its gradient estimates is smaller than in REINFORCE as, e.g., discussed here. This intuition makes sense for the reasons outlined in the linked lecture.

Is there a paper / lecture providing a formal proof of that claim for any type of actor-critic algorithm (e.g. the Q Actor-Critic)?


Posted 2020-06-28T22:51:35.317

Reputation: 111

No answers