Monte Carlo epsilon-greedy Policy Iteration: monotonic improvement for all cases or for the expected value?

1

I was going through university slides and this particular slide is trying to prove that in a Monte Carlo Policy Iteration algorithm using an epsilon-greedy policy, the state Values (V-Values) are monotonically improving.

enter image description here

My question is about the first line of computation.

enter image description here

Isn't this actually the formula for the expected value of Q? It is calculating a probability of occurrence following the policy times actual Q values, then doing the summation.

If that is the case, could you help me understand the relationship between the expected value of Q and the expected value of V ?

Also, if above is true, in a real world scenario, depending on how many episodes we sample and on stochasticity, does it mean that the V values of the new policy could be worse than the V values of the old policy ?

devidduma

Posted 2020-04-25T20:06:16.880

Reputation: 47

Answers

0

I think this equation answer your question: $$ q_{\pi^{i}}(s,\pi^{i+1}(s)) = \mathbf{E}[q_{\pi^{i}}(s,\pi^{i+1}(s))] = \sum_{a \in A}\pi^{i+1}(a|s)q_{\pi^{i}}(s,a)$$

value of the Q while taking action from policy $\pi^{i+1}$ and thereafter following the policy $\pi^{i}$ is equal to the expected q value while taking action from policy $\pi^{i+1}$ and thereafter following the policy $\pi^{i}$. And for the second part of your question the answer is:

$$ V_{\pi^{i}}(s) = q_{\pi^{i}}(s,\pi^{i}(s))$$

state Value function following the policy $\pi^{i}$ is the same as action-value function while taking action from policy $\pi^{i}$ and thereafter following the policy $\pi^{i}$.

Swakshar Deb

Posted 2020-04-25T20:06:16.880

Reputation: 432