The intuition provided when introducing actor-critic algorithms is that the variance of its gradient estimates is smaller than in REINFORCE as, e.g., discussed here. This intuition makes sense for the reasons outlined in the linked lecture.

Is there a paper / lecture providing a formal proof of that claim for any type of actor-critic algorithm (e.g. the Q Actor-Critic)?