## Why is the stationary distribution independent of the initial state in the proof of the policy gradient theorem?

3

I was going through the proof of the policy gradient theorem here: https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html#svpg

In the section "Proof of Policy Gradient Theorem" in the block of equations just under the sentence "The nice rewriting above allows us to exclude the derivative of Q-value function..." they set $$\eta (s) = \sum^\infty_{k=0} \rho^\pi(s_0 \rightarrow s, k)$$ and $$\sum_s \eta (s) = const$$ Thus, they basically assume, that the stationary distribution is not dependent on the initial state. But how can we justify this? If the MDP is described by a block diagonal transition matrix, in my mind this should not hold.

1

I think your doubt is completely reasonable. Probably there is an additional assumption that they (both Lilian Weng and Rich Sutton (pag.269)) do not make explicit in the proof and that is that your MDP is not only stationary, but also ergodic. A particular property of those systems is that the probability of eventually reaching a state $$s$$ from a starting point $$s_0$$ is 1. In such a case it is clear that $$\eta(s)$$ exists and is independent of any $$s_0$$ chosen.

Clearly, an MDP with block-diagonal transition matrix does not satisfy such an assumption since the starting point completely restricts those states you can reach in an infinite time.

What I do not understand is why Rich Sutton does mention ergodicity as a necessary condition in the case of a "continuing task", as opposed to "episodic tasks" (pag.275). For me, their proof requires this condition in both cases.

As an additional note, I also think that Lilian Weng does not really explain why we should buy that from the initial reasonable definition $$J(\theta)=\sum_\mathcal S d^{\pi_\theta}(s)V^{\pi_\theta}(s)$$ we should accept the much simpler one $$J(\theta)=V^{\pi_\theta}(s_0)$$. I guess the only reason is that the gradient of the initial expression does require to know the gradient of $$d^{\pi_\theta}(s)$$ and so you would be accepting the approximation:

$$\nabla_\theta J(\theta)=\nabla_\theta\left(\sum_\mathcal S d^{\pi_\theta}(s)V^{\pi_\theta}(s)\right)\approx\sum_\mathcal S d^{\pi_\theta}(s)\nabla_\theta V^{\pi_\theta}(s),$$

where the last term is just $$\nabla_\theta V^{\pi_\theta}(s_0)$$ under the ergodicity assumption.

Yes, the second part you mentioned left me also a bit confused. I don't quite see why $\sum d^\pi (s) V^\pi (s) = V^\pi (s_0)$ should be true, even if the system is ergodic. Generally, $V^\pi(s_0)$ makes more sense to me since in practice we always start from $s_0$ and don't sample $s_0$ from $d^\pi (s)$. – Luca Thiede – 2019-12-04T23:15:31.457

I'm not really saying that the expected value of $V^{\pi}(s)$ is $V^\pi(s_0)$. What is true (if the system is ergodic) is that the expected value of $\nabla V^{\pi}(s)$ is $\nabla V^\pi(s_0)$, given that the gradient is the same for all states since $\eta(s)$ does not depend on $s_0$. – Diego Gomez – 2019-12-04T23:31:57.130

About what performance metric is better, I consider that usually you want your policy to be robust, so you not only want to consider one single $s_0$. What happens then if by some perturbance your agent is forced to start in a different state? At least you should consider an initial distribution $\rho(s_0)$. This is done in the TRPO paper, for example. – Diego Gomez – 2019-12-04T23:32:49.063