3

2

In actor critic, The equations for calculating the loss in actor critic are an

*actor* loss (parameterized by $\theta$)

$$log[\pi_\theta(s_t,a_t)]Q_w(s_t,a_t)$$

and a *critic* loss (parameterized by $w$)

$$r(s_t,a_t) + \gamma Q_w(s_{t+1}, a_{t+1}) - Q_w(s_{t}, a_t).$$

This is bootstrapping in experience replay:

$$ L_i(\theta_i) = \mathbb{E}_{(s, a, r, s') \sim U(D)} \left[ \left(r + \gamma \max_{a'} Q(s', a'; \theta_i^-) - Q(s, a; \theta_i)\right)^2 \right] $$

It is clear that bootstrapping is comparable to the critic loss, except that the $max$ operation is lacking from the critic.

**As i see it, (correct me if I'm wrong)**:

$Q(s_t,a_t) = V(s_{t+1}) + r_t$ where $a_t$ is the actual action that had been taken.

The critic, as I understand, estimates $V(s)$

**My question:**

**What exactly is the critic calculating?**

**What In actor critic outputs $Q(s_{t+1},a_{t+1})$?**

It seems to me like the critic calculates the average next state $s_{t+1}$ value, over all possible actions, with their corresponding probabilities, yielding

$Q(s_t, a_t) = r_t + \sum_{a_{t+1} \in A}P(a_{t+1}|s_t)V(s_{t+1})$

Which would mean that in order to get $Q(s_{t+1}, a_{t+1})$ for the above formula, I would need to calculate

$Q(s_{t+1}, a_{t+1}) = r_{t+1} + \sum_{a_{t+2} \in A}P(a_{t+2}|s_{t+1})V(s_{t+2})$

Where $V(s_{t+2})$ is the critic output on $s_{t+2}$, a state we get to by taking action $a_{t+1}$ from state $s_{t+1}$ but I am not sure that is indeed the meaning of the critic output and still it is unclear to me how to get $Q(s_{t+1}, a_{t+1})$ from actor critic.

If indeed that is what's being calculated, then why is it mathematically true that an improvement is being made? Or why does it make sense (even if not mathematically always true)?

Practical use:

I want to use actor critic with experience replay in an environment with a large action space (could be continuous). Therefore, I cannot use the $max$ term. I need to understand the correct equation for the critic loss, and why it works.

It is not very clear your "As i see it, (correct me if I'm wrong)" part. Can you be more precise with your notation? For example, you have not $Q(s, a)$ anywhere in the previous equations. Also, why are you trying to look at Q(s, a) as Q(s)? I've never seen Q(s), but only V(s). – nbro – 2019-02-06T15:48:43.263

Furthermore, what is the relation between your "As i see it, (correct me if I'm wrong)" part and your actual question: "What exactly is the critic calculating?"? Or, what does experience replay have to do with this? I think you should ask one question per post, otherwise it is even more confusing. Ask just one question. – nbro – 2019-02-06T15:51:15.820

@nbro I made a major edit. Please check again to see if it is clearer now. Also, I am not confident enough about any of my understanding, so even asking the correct questions is difficult. – Gulzar – 2019-02-06T16:46:11.023