1

I was going through university slides and this particular slide is trying to prove that in a Monte Carlo Policy Iteration algorithm using an epsilon-greedy policy, the state Values (V-Values) are monotonically improving.

My question is about the first line of computation.

Isn't this actually the formula for the expected value of Q? It is calculating a probability of occurrence following the policy times actual Q values, then doing the summation.

If that is the case, could you help me understand the relationship between the expected value of Q and the expected value of V ?

Also, if above is true, in a real world scenario, depending on how many episodes we sample and on stochasticity, does it mean that the V values of the new policy could be worse than the V values of the old policy ?