Understanding lemma 2 of the "Trust Region Policy Optimization" paper



In the Trust Region Policy Optimization paper, in Lemma 2 of Appendix A, I did not quite understand deriving inequality (31) from (30), which is:

$$\bar{A}(s) = P(a \neq \tilde{a} | s) \mathbb{E}_{(a, \tilde{a}) \sim (\pi, \tilde{\pi})|a \neq \tilde{a}} \left[ A_{\pi}(s, \tilde{a}) - A_{\pi}(s,a) \right]$$ $$|\bar{A}(s)| \le \alpha. 2 \max_{s,a} |A_{\pi}(s,a)|$$

Would you mind let me know how the inequality is derived?

Afshin Oroojlooy

Posted 2018-11-27T16:52:12.273

Reputation: 143



We can start with equation (30):

$$ \bar{A}(s) = P(a \neq \tilde{a}) \mathbb{E}_{(a,\tilde{a})\sim(\pi,\tilde{\pi}|a\neq\tilde{a})} [A_\pi(s, \tilde{a}) - A_\pi(s, a)] $$

Taking the absolute value of both sides, the equality remains true. We can pull the probability term out of the absolute value since it is guaranteed to be nonnegative.

$$ |\bar{A}(s)| = P(a \neq \tilde{a}) |\mathbb{E}_{(a,\tilde{a})\sim(\pi,\tilde{\pi}|a\neq\tilde{a})} [A_\pi(s, \tilde{a}) - A_\pi(s, a)]| $$

By Definition 1, $P(a \neq \tilde{a}) \leq \alpha$. Substituting this definition in, we get:

$$ |\bar{A}(s)| \leq \alpha \cdot |\mathbb{E}_{(a,\tilde{a})\sim(\pi,\tilde{\pi}|a\neq\tilde{a})} [A_\pi(s, \tilde{a}) - A_\pi(s, a)]| $$

By Jensen's Inequality, we can take the absolute value inside the expectation.

$$ |\bar{A}(s)| \leq \alpha \cdot \mathbb{E}_{(a,\tilde{a})\sim(\pi,\tilde{\pi}|a\neq\tilde{a})} [|A_\pi(s, \tilde{a}) - A_\pi(s, a)|] $$

The expectation of a random variable is always upper bounded by the max value of that variable.

$$ |\bar{A}(s)| \leq \alpha \cdot \max_{a,\tilde{a}|a\neq\tilde{a}} |A_\pi(s, \tilde{a}) - A_\pi(s, a)| $$

This part is a little strange, and I'm not sure if this is the logic that the authors followed, but it is still true. For any $a,b$, we have that $|a - b| \leq |a| + |b|$.

$$ |\bar{A}(s)| \leq \alpha \cdot \max_{a,\tilde{a}|a\neq\tilde{a}} (|A_\pi(s, \tilde{a})| + |A_\pi(s, a)|) $$

For $a \neq \tilde{a}$, it must be that either $|A_\pi(s, \tilde{a})| \geq |A_\pi(s, a)|$ or vice versa. We can use this to replace the pair of advantage functions with 2 times the max of the two.

$$ |\bar{A}(s)| \leq \alpha \cdot 2 \max_{a} |A_\pi(s, a)| $$

Now, if we take the max over $a$ and $s'$, our inequality still holds, since we are taking the maximum over a set that contains $s$. Making this substitution gives us (31).

$$ |\bar{A}(s)| \leq \alpha \cdot 2 \max_{a,s'} |A_\pi(s', a)| $$

Nishant Desai

Posted 2018-11-27T16:52:12.273

Reputation: 91