## Why can't Policy Gradient Algorithm be seen as an Actor-Critic Method?

1

During the equation deducing in policy gradient algorithm(e.g., REINFORCE), we are actually using an expectancy of total reward, which we try to maximize.

$$\overline{R_\theta}=E_{\tau\sim\pi_\theta}[R(\tau)]$$

Can't it be seen as an Actor-Critic Method since we are using V(s) as a Critic to guide the update of Actor π? (Here we've already introduced an approximation) $$\nabla \overline{R_\theta} = \sum_{n=1}^N R(\tau^{(n)}) \nabla \log p(\tau)$$ If not, what's the clear definition of Actor and Critic defined in Actor-Critic Algorithm.

• Critic-only methods such as Q-learning in which the output is the expected reward for every available action ($$Q(s,a)$$ $$\forall a\in A$$)