## Proof subtracting baseline doesn't influence gradient can be used to show no gradient exist at all?

0

I am using David Silver's course in RL to help me write my thesis. However, I am baffled by the proof given in lecture 7 slide 29: slideshow

\begin{align} \mathbb{E}_{\pi_\theta}[\nabla_\theta \log_\theta (s,a)B(s)] &= \sum_{s \in S}d^{\pi_\theta} (s) \sum_a \nabla_\theta \pi_\theta(s,a)B(s)\\ &=\sum_{s \in S} d^{\pi_\theta} B(s) \nabla_\theta\sum_{a \in A} \pi_\theta(s,a)\\ &=0 \end{align}

Consider in this proof replacing $$b(s)$$ with the critic's quality estimate $$Q_w(s,a)$$ (see previous slide(s)). How does this proof not also show that the gradient of the objective function $$\nabla_\theta J(\theta)$$ should also be $$0$$? Does this have to do with the second summation term changing from being over $$a$$ to over $$a \in \mathcal{A}$$?

Thank you.

After thinking about this, I've realized that $$Q(s,a)$$ relies on the action and thus cannot be pulled out of the sum in the same way $$B(s)$$ can. I'm leaving this up for anyone interested in the same thing.
The crucial point here is, that the baseline is state dependent, therefore the notation $$B(s)$$. If you use the estimate $$Q_w(s, a)$$ you get a baseline that depends on both states and actions, basically $$B(s, a)$$.