0

I am using David Silver's course in RL to help me write my thesis. However, I am baffled by the proof given in lecture 7 slide 29: slideshow

\begin{align} \mathbb{E}_{\pi_\theta}[\nabla_\theta \log_\theta (s,a)B(s)] &= \sum_{s \in S}d^{\pi_\theta} (s) \sum_a \nabla_\theta \pi_\theta(s,a)B(s)\\ &=\sum_{s \in S} d^{\pi_\theta} B(s) \nabla_\theta\sum_{a \in A} \pi_\theta(s,a)\\ &=0 \end{align}

Consider in this proof replacing $b(s)$ with the critic's quality estimate $Q_w(s,a)$ (see previous slide(s)). How does this proof not also show that the gradient of the objective function $\nabla_\theta J(\theta)$ should also be $0$? Does this have to do with the second summation term changing from being over $a$ to over $a \in \mathcal{A}$?

Thank you.