Policy Iteration is essentially a two step process:

- Evaluate the current policy by calculating $v(s)$ for every non-terminal state.
- Internally this requires multiple loops over all the states until the value calculations are accurate enough

- Improve the policy by choosing the best action $\pi(s)$ for every non-terminal state

The overall process terminates when it is not possible to improve the policy in the second part. When that happens, then - from the Bellman equations for the optimal value function - the value function and policy must be optimal.

In detail for your example, it might work like this:

Initialise our value function for $[S_0, S_1, S_2]$ as $[0,0,0]$ - note that by definition for the terminal state $v(S_2)=0$ so we never need to recalculate that (this is also consistent with the diagram, which shows the "absorbing" actions) so we *could* actually calculate it if we want, but it's a waste of time.

Choose a value for accuracy limit $\theta$. To avoid going on forever in the example, let's pick a high value, say $0.01$. However, if this was computer code, I might set it lower e.g. $10^{-6}$

Set our discount factor, $\gamma = 0.1$. Note this is quite low, heavily emphasising immediate rewards over longer-term ones.

Set our current policy $\pi = [B, R]$. We're ready to start.

First we have to do policy evaluation, which is multiple passes through all states, updating $v(s) = \sum_{s',r} p(s',r|s,\pi(s))(r + \gamma v(s'))$. There are other ways of writing this formula, all basically equivalent, the one here is from Sutton & Barto 2nd edition. We also can choose between a "pure" iteration calculating $v_{k+1}(s)$ from $v_{k}(s)$, but I will do the update in place to the same $v$ array, because that is easier and usually faster too.

Note each sum here is over the two possible $(s', r)$ results from each action.

Pass 1 in detail, iterating though states $[S_0, S_1]$:

- Set max delta ($\Delta = 0$) to track our accuracy
- $v(S_0) = \sum_{s',r} p(s',r|S_0,B)(r + \gamma v(s')) = 0.5(-5 + 0.1 v(S_1)) + 0.5 (-2 + 0.1 v(S_0)) = -3.5$
- This sets max delta to 3.5
- $v(S_1) = \sum_{s',r} p(s',r|S_1,R)(r + \gamma v(s')) = 0.6(-5 + 0.1 v(S_1)) + 0.4 (0 + 0.1 v(S_2)) = -3$
- Max delta is still 3.5
- The $v$ table is now $[-3.5, -3, 0]$

- Pass 2, because $\Delta \gt 0.01$:
- Reset $\Delta = 0$
- $v(S_0) = 0.5(-5 + 0.1 v(S_1)) + 0.5 (-2 + 0.1 v(S_0)) = -3.825$
- $\Delta = 0.325$
- $v(S_1) = 0.6(-5 + 0.1 v(S_1)) + 0.4 (0 + 0.1 v(S_2)) = -3.180$
- $\Delta = 0.325 \gt 0.01$
- The $v$ table is now $[-3.825, -3.180, 0]$

- Pass 3, $v = [-3.850, -3.191, 0]$
- Pass 4, $v = [-3.852, -3.191, 0]$, this is small enough change that we'll say evaluation has converged.

- Now we need to check whether the policy should be changed. For each non-terminal state we need to work through all possible actions and find the action with the highest expected return, $\pi(s) = \text{argmax}_a \sum_{s',r} p(s',r|s,a)(r + \gamma v(s'))$
- For $S_0$:
- Action B scores $0.5(-5 + 0.1 v(S_1)) + 0.5 (-2 + 0.1 v(S_0)) = -3.852$ (technically, we already know that because it is the current policy's action)
- Action R scores $0.9(-2 + 0.1 v(S_0)) + 0.1 (0 + 0.1 v(S_2)) = -2.147$
- R is the best action here, so change $\pi(S_0) = R$

- For $S_1$:
- Action B scores $0.8(-2 + 0.1 v(S_0)) + 0.2 (-5 + 0.1 v(S_1)) = -2.971$
- Action R scores -3.191
- B is the best action here, so change $\pi(S_1) = B$

- Our policy has changed, so we go around both major stages again, starting with policy evaluation and the current $v = [-3.852, -3.191, 0]$, using the new policy $[R, B]$

I leave it up to you to do the second pass through evaluation and improvement. I suspect it will show that the new policy is optimal, and the policy will remain $[R, B]$ (although $[R,R]$ seems possible without doing the calculation). If the policy stays the same on a complete pass through evaluation and improvement, then you are done.

1OK, so the trouble is that any answer here will necessarily be an

explanation. Could you maybe link an explanation that you are struggling with, and show where it is that you lose understanding? Also, what form is your policy iteration taking? A manual work-through of the process, or some code? – Neil Slater – 2018-06-05T15:09:14.277Another question. In your diagram, are the rewards for arriving in, or leaving the state that you have labelled? E.g. does the label $S_0, -2$ mean that the agent receives a -2 reward for

arrivingin $S_0$ or for leaving it? Both are valid MDPs, but the value functions will be different. – Neil Slater – 2018-06-05T15:14:25.767Its the reward for Arriving. For your first question, its a manual work through. – The_C – 2018-06-06T15:14:53.687