## Using Policy Iteration on an automaton

2

I've read many explanation on how do to policy iteration, but I can't find an example, so I'm stuck right now trying to figure out to Policy Iteration.

The numbers next to each state show the reward received for arriving in that state. For example, if the agent started in $S_0$ and took the $Blue$ action, and ended up in state $S_1$, the immediate reward would be $-5$.

The discount value is 0.1, and the initial policy is $\pi(S_0)=Blue$ and $\pi(S_1)=Red$

$S_2$ state is the terminal state - game over. The two possible actions are Blue and Red as can be seen on the image.

I just need something to help me start, because no explanation really made me understand how do I start the policy iteration till convergence.

1OK, so the trouble is that any answer here will necessarily be an explanation. Could you maybe link an explanation that you are struggling with, and show where it is that you lose understanding? Also, what form is your policy iteration taking? A manual work-through of the process, or some code? – Neil Slater – 2018-06-05T15:09:14.277

Another question. In your diagram, are the rewards for arriving in, or leaving the state that you have labelled? E.g. does the label $S_0, -2$ mean that the agent receives a -2 reward for arriving in $S_0$ or for leaving it? Both are valid MDPs, but the value functions will be different. – Neil Slater – 2018-06-05T15:14:25.767

Its the reward for Arriving. For your first question, its a manual work through. – The_C – 2018-06-06T15:14:53.687

1

Policy Iteration is essentially a two step process:

1. Evaluate the current policy by calculating $v(s)$ for every non-terminal state.
• Internally this requires multiple loops over all the states until the value calculations are accurate enough
2. Improve the policy by choosing the best action $\pi(s)$ for every non-terminal state

The overall process terminates when it is not possible to improve the policy in the second part. When that happens, then - from the Bellman equations for the optimal value function - the value function and policy must be optimal.

In detail for your example, it might work like this:

• Initialise our value function for $[S_0, S_1, S_2]$ as $[0,0,0]$ - note that by definition for the terminal state $v(S_2)=0$ so we never need to recalculate that (this is also consistent with the diagram, which shows the "absorbing" actions) so we could actually calculate it if we want, but it's a waste of time.

• Choose a value for accuracy limit $\theta$. To avoid going on forever in the example, let's pick a high value, say $0.01$. However, if this was computer code, I might set it lower e.g. $10^{-6}$

• Set our discount factor, $\gamma = 0.1$. Note this is quite low, heavily emphasising immediate rewards over longer-term ones.

• Set our current policy $\pi = [B, R]$. We're ready to start.

• First we have to do policy evaluation, which is multiple passes through all states, updating $v(s) = \sum_{s',r} p(s',r|s,\pi(s))(r + \gamma v(s'))$. There are other ways of writing this formula, all basically equivalent, the one here is from Sutton & Barto 2nd edition. We also can choose between a "pure" iteration calculating $v_{k+1}(s)$ from $v_{k}(s)$, but I will do the update in place to the same $v$ array, because that is easier and usually faster too.

• Note each sum here is over the two possible $(s', r)$ results from each action.

• Pass 1 in detail, iterating though states $[S_0, S_1]$:

• Set max delta ($\Delta = 0$) to track our accuracy
• $v(S_0) = \sum_{s',r} p(s',r|S_0,B)(r + \gamma v(s')) = 0.5(-5 + 0.1 v(S_1)) + 0.5 (-2 + 0.1 v(S_0)) = -3.5$
• This sets max delta to 3.5
• $v(S_1) = \sum_{s',r} p(s',r|S_1,R)(r + \gamma v(s')) = 0.6(-5 + 0.1 v(S_1)) + 0.4 (0 + 0.1 v(S_2)) = -3$
• Max delta is still 3.5
• The $v$ table is now $[-3.5, -3, 0]$
• Pass 2, because $\Delta \gt 0.01$:
• Reset $\Delta = 0$
• $v(S_0) = 0.5(-5 + 0.1 v(S_1)) + 0.5 (-2 + 0.1 v(S_0)) = -3.825$
• $\Delta = 0.325$
• $v(S_1) = 0.6(-5 + 0.1 v(S_1)) + 0.4 (0 + 0.1 v(S_2)) = -3.180$
• $\Delta = 0.325 \gt 0.01$
• The $v$ table is now $[-3.825, -3.180, 0]$
• Pass 3, $v = [-3.850, -3.191, 0]$
• Pass 4, $v = [-3.852, -3.191, 0]$, this is small enough change that we'll say evaluation has converged.
• Now we need to check whether the policy should be changed. For each non-terminal state we need to work through all possible actions and find the action with the highest expected return, $\pi(s) = \text{argmax}_a \sum_{s',r} p(s',r|s,a)(r + \gamma v(s'))$
• For $S_0$:
• Action B scores $0.5(-5 + 0.1 v(S_1)) + 0.5 (-2 + 0.1 v(S_0)) = -3.852$ (technically, we already know that because it is the current policy's action)
• Action R scores $0.9(-2 + 0.1 v(S_0)) + 0.1 (0 + 0.1 v(S_2)) = -2.147$
• R is the best action here, so change $\pi(S_0) = R$
• For $S_1$:
• Action B scores $0.8(-2 + 0.1 v(S_0)) + 0.2 (-5 + 0.1 v(S_1)) = -2.971$
• Action R scores -3.191
• B is the best action here, so change $\pi(S_1) = B$
• Our policy has changed, so we go around both major stages again, starting with policy evaluation and the current $v = [-3.852, -3.191, 0]$, using the new policy $[R, B]$

I leave it up to you to do the second pass through evaluation and improvement. I suspect it will show that the new policy is optimal, and the policy will remain $[R, B]$ (although $[R,R]$ seems possible without doing the calculation). If the policy stays the same on a complete pass through evaluation and improvement, then you are done.