Why can't Policy Gradient Algorithm be seen as an Actor-Critic Method?


During the equation deducing in policy gradient algorithm(e.g., REINFORCE), we are actually using an expectancy of total reward, which we try to maximize.


Can't it be seen as an Actor-Critic Method since we are using V(s) as a Critic to guide the update of Actor π? (Here we've already introduced an approximation) $$\nabla \overline{R_\theta} = \sum_{n=1}^N R(\tau^{(n)}) \nabla \log p(\tau)$$ If not, what's the clear definition of Actor and Critic defined in Actor-Critic Algorithm.


Posted 2019-07-22T09:54:45.860

Reputation: 11



In RL we have:

  • Actor-only methods such as REINFORCE in which the output is a probability distributions over actions. REINFORCE is a policy gradient method but doesnt use a critic.
  • Critic-only methods such as Q-learning in which the output is the expected reward for every available action ($Q(s,a)$ $ \forall a\in A $)
  • Actor-Critic methods that involve both Actor and Critic estimations. For example the popular DDPG and A3C algorithms. Both algorithms are policy gradient methods. By reading the papers you will start getting a sense on why the simple REINFORCE introduces variance in gradient estimations and how a critic can reduce it.

Policy Gradient methods are based on the Policy Gradient theorem. A standard implementation is an Actor-Critic algorithm, and we use both Actor (probability distribution over actions) and Critic (Value functions) in order to trade-off bias and variance in your gradient estimations.


Posted 2019-07-22T09:54:45.860

Reputation: 1 531