Why does (not) the distribution of states depend on the policy parameters that induce it?


I came across the following proof of what's commonly referred to as the log-derivative trick in policy-gradient algorithms, and I have a question -

enter image description here

While transitioning from the first line to the second, the gradient with respect to policy parameters $\theta$ was pushed into the summation. What bothers me is how it skipped over $\mu (s)$, the distribution of states - which (the way I understand it), is induced by the policy $\pi_\theta$ itself! Why then does it not depend on $\theta$?

Let me know what's going wrong! Thank you!


Posted 2020-08-27T10:36:32.770

Reputation: 602

Form a math view, If the expression of expected value of reward is correct then that is the correct thing to do. $\mu$ is a function of state $s$ so a derivative w.r.t $\theta$ will not affect it. So if what you are claiming is true then $\mu$ would be a function of $\theta$. So you just need to check whether the expression of expected reward is correct or not. – DuttaA – 2020-08-27T11:20:38.093



The proof you are given in the above post is not wrong. It's just they skip some of the steps and directly written the final answer. Let me go through those steps:

I will simplify some of the things to avoid complication but the generosity remains the same. Like I will think of the reward as only dependent on the current state, $s$, and current action, $a$. So, $r = r(s,a)$

First, we will define the average reward as: $$r(\pi) = \sum_s \mu(s)\sum_a \pi(a|s)\sum_{s^{\prime}} P_{ss'}^{a} r $$ We can further simplify average reward as: $$r(\pi) = \sum_s \mu(s)\sum_a \pi(a|s)r(s,a) $$ My notation may be slightly different than the aforementioned slides since I'm only following Sutton's book on RL. Our objective function is: $$ J(\theta) = r(\pi) $$ We want to prove that: $$ \nabla_{\theta} J(\theta) = \nabla_{\theta}r(\pi) = \sum_s \mu(s) \sum_a \nabla_{\theta}\pi(a|s) Q(s,a)$$

Now let's start the proof: $$\nabla_{\theta}V(s) = \nabla_{\theta} \sum_{a} \pi(a|s) Q(s,a)$$ $$\nabla_{\theta}V(s) = \sum_{a} [Q(s,a) \nabla_{\theta} \pi(a|s) + \pi(a|s) \nabla_{\theta}Q(s,a)]$$ $$\nabla_{\theta}V(s) = \sum_{a} [Q(s,a) \nabla_{\theta} \pi(a|s) + \pi(a|s) \nabla_{\theta}[R(s,a) - r(\pi) + \sum_{s^{\prime}}P_{ss^{\prime}}^{a}V(s^{\prime})]]$$ $$\nabla_{\theta}V(s) = \sum_{a} [Q(s,a) \nabla_{\theta} \pi(a|s) + \pi(a|s) [- \nabla_{\theta}r(\pi) + \sum_{s^{\prime}}P_{ss^{\prime}}^{a}\nabla_{\theta}V(s^{\prime})]]$$ $$\nabla_{\theta}V(s) = \sum_{a} [Q(s,a) \nabla_{\theta} \pi(a|s) + \pi(a|s) \sum_{s^{\prime}}P_{ss^{\prime}}^{a}\nabla_{\theta}V(s^{\prime})] - \nabla_{\theta}r(\pi)\sum_{a}\pi(a|s)$$ Now we will rearrange this: $$\nabla_{\theta}r(\pi) = \sum_{a} [Q(s,a) \nabla_{\theta} \pi(a|s) + \pi(a|s) \sum_{s^{\prime}}P_{ss^{\prime}}^{a}\nabla_{\theta}V(s^{\prime})] - \nabla_{\theta}V(s)$$ Multiplying both sides by $\mu(s)$ and summing over $s$: $$\nabla_{\theta}r(\pi) \sum_{s}\mu(s)= \sum_{s}\mu(s) \sum_{a} Q(s,a) \nabla_{\theta} \pi(a|s) + \sum_{s}\mu(s) \sum_a \pi(a|s) \sum_{s^{\prime}}P_{ss^{\prime}}^{a}\nabla_{\theta}V(s^{\prime}) - \sum_{s}\mu(s) \nabla_{\theta}V(s)$$ $$\nabla_{\theta}r(\pi) = \sum_{s}\mu(s) \sum_{a} Q(s,a) \nabla_{\theta} \pi(a|s) + \sum_{s^{\prime}}\mu(s^{\prime})\nabla_{\theta}V(s^{\prime}) - \sum_{s}\mu(s) \nabla_{\theta}V(s)$$ Now we are there: $$\nabla_{\theta}r(\pi) = \sum_{s}\mu(s) \sum_{a} Q(s,a) \nabla_{\theta} \pi(a|s)$$ This is the policy gradient theoram for average reward formulation (ref. Policy gradient).

Swakshar Deb

Posted 2020-08-27T10:36:32.770

Reputation: 432

Do you mean $\pi (a|s)$ instead of $\pi (s|a)$? Correct me if I'm misinterpreting! – cogito_ai – 2020-08-29T06:13:41.047

Also a little confused with the notation - why is $Q(s,a) = R(s,a) - r(\pi) + \sum_{s^{\prime}}P_{ss^{\prime}}^{a}V(s^{\prime})$ ? – cogito_ai – 2020-08-29T06:24:43.610

@cogito_ai I followed Sutton's notation. So, denoted policy as $\pi(a|s)$. For your case, $\pi(a|s) = \pi(s|a)$. But I think this is a mistake in this slide, it should be $\pi(a|s)$. – Swakshar Deb – 2020-08-29T06:30:19.887

I think you've denoted the policy by $\pi(s|a)$ in your answer in multiple places - or is that something else? The slide shows $a|s$ – cogito_ai – 2020-08-29T06:31:35.413

For average reward setting $Q(s,a) = r(s,a) - r(\pi) + \sum P_{ss'}v(s')$. See Sutton's book chapter on policy funciton approximation section average reward. – Swakshar Deb – 2020-08-29T06:32:38.123

1Sorry, it should be $\pi(a|s)$. It was my mistake. – Swakshar Deb – 2020-08-29T06:33:36.660

How is $\sum_{s^{\prime}}\mu(s^{\prime})\nabla_{\theta}V(s^{\prime}) = \sum_{s}\mu(s) \sum_a \pi(s,a) \sum_{s^{\prime}}P_{ss^{\prime}}^{a}\nabla_{\theta}V(s^{\prime})$? Also which edition of Sutton's book are you using? – cogito_ai – 2020-08-29T06:35:14.483

Because this is the product rule of probability. This is not in the Sutton's book. I give the reference, check it for more details. – Swakshar Deb – 2020-08-29T06:37:01.877

1If you open the summation you can see clearly this is exactly telling us what is the probability of the agent to go to the state $s'$ from the state $s$. – Swakshar Deb – 2020-08-29T06:39:33.947

Looks good! Lastly, in places where you've written $\pi(a,s)$, did you mean $\pi(a|s)$ there too? – cogito_ai – 2020-08-29T06:41:53.990

Yes, I make a mistake, it should be $\pi(a|s)$. – Swakshar Deb – 2020-08-29T06:42:33.997

Your formulation ends with $Q(s,a)$, while the slide has $R(s,a)$ - why the difference? I understand the rest! It'd be great if you could elaborate on this! – cogito_ai – 2020-08-29T06:44:10.477

I do not follow this lecture for RL, So, I can not tell it now, first I should watch Deep RL Bootcamp. But I hope, if you think about it you may find the reason. – Swakshar Deb – 2020-08-29T06:50:31.463

1Okay, I watched their lecture. At the time 40.55 they, replace $\sum r(s,a)$ with $Q(s,a)$. – Swakshar Deb – 2020-08-29T07:11:03.917

1$Q(s,a)$ is an unbiased estimate of $R(s,a)$ — ie they are the same in expectation. @cogito_ai – David Ireland – 2020-08-29T08:58:11.243


The reason you are confused is because this is not the full derivation of the Policy Gradient Theorem. You are correct in thinking that $\mu(s)$ depends on the policy $\pi$ which in turn depends on the policy parameters $\theta$, and so there should be a derivative of $\mu$ wrt $\theta$, however the Policy Gradient Theorem doesn't require you to take this derivative.

In fact, the great thing about the Policy Gradient Theorem is that the final result does not require you to take a derivative of the state distribution with respect to the policy parameters. I would encourage you to read and go through the derivation of the Policy Gradient Theorem from e.g. Sutton and Barto to see why you don't need to take the derivative.

enter image description here

Above is an image of the Policy Gradient Theorem proof from the Sutton and Barto book. If you carefully go through this line by line you will see that you are not required to take a derivative of the state distribution anywhere in the proof.

David Ireland

Posted 2020-08-27T10:36:32.770

Reputation: 1 942

Could you please elaborate and possibly provide mathematical details? I've gone through the derivation in Sutton and Barto and I'm still unclear why the derivation above works. – cogito_ai – 2020-08-28T03:22:55.830

1There’s not much to elaborate on. If you followed the Sutton and Barto proof you would see that you don’t need to take a derivative of the state distribution. I can’t really explain any clearer than just writing out the full proof which I am not going to do since that is not the point of this site. If you have a particular question about the proof that you find unclear then you should post that and I’d be happy to answer it. – David Ireland – 2020-08-28T08:53:37.853

Also, the above isn't a derivation. It is an extremely poor example, especially if they have not previously gone through the policy gradient theorem proof. What they have written (first two lines) gets the right answer but goes about it the wrong way. – David Ireland – 2020-08-28T09:21:41.650

I understand the proof written in Sutton and Barto's text. My question was - why does the wrong proof (in my post) work? I apologize if you've already explained this, but I still don't see why the wrong proof works. Sutton and Barto's proof doesn't require us to take the derivative of the state distribution anywhere, but that is no reason for me to overlook the same in the proof I've attached in my post. Please explain. Thanks! – cogito_ai – 2020-08-28T13:37:44.950

What you have posted is not a proof. They have started with the final result of the Policy Gradient Theorem, and then they have derived the update that is used in REINFORCE. As I say, the source you are using does not look like something I would recommend. The equality on the first line should be replaced directly with the equality on the second line to avoid confusion. Regardless, if you want good source material for Reinforcement Learning then I would recommend you read Sutton and Barto to prepare you for more advanced topics you will find in papers. – David Ireland – 2020-08-28T13:40:21.410

@cogito_ai to answer your edit, what you have posted is not a proof of the policy gradient theorem. The author has started with the final result in the policy gradient theorem, and has derived something entirely separate. – David Ireland – 2020-08-28T13:44:12.377