Reinforcement Learning - How are these state values in MRP calculated?


This is a question from the book an Introduction to RL, page 125, example 6.2.

The example compares the prediction abilities of TD(0) and constant $ \alpha $ MC when applied to the below Markov reward process (the image is copied form the book):

enter image description here

In the above MRP, all episodes start in the state C then can go either left or right by one state in each step (with equal probability).

Episodes terminate either on the extreme left or the extreme right. When an episode terminates on the right, a reward of +1 occurs; all other rewards are zero. For example, a typical episode might consist of the following state-and-reward sequence: C, 0,B, 0, C, 0,D, 0, E, 1. Because this task is undiscounted, the true value of each state is the probability of terminating on the right if starting from that state.

The book mentions V(A)=1/6, V(B)=2/6, V(C)=3/6 , V(D)=4/6 and V(E)=5/6; could you please help me understand how these values were calculated? Thanks.

Melanie A

Posted 2018-11-08T05:18:25.990

Reputation: 37



As it is such a small MRP, it is possible to solve it quickly and analytically using simultaneous equations based on the Bellman equation:

$$v(s) = \sum_{r,s'} p(r,s'|s)(r + v(s'))$$

and substituting in each state in turn. Using the variables $a,b,c,d,e$ to represent $v(A), v(B), v(C), v(D), v(E)$ makes this easier to read:

  • $a = \frac{1}{2}(0 + b)$

  • $b = \frac{1}{2}(a+c)$

  • $c = \frac{1}{2}(b+d)$

  • $d = \frac{1}{2}(c+e)$

  • $e = \frac{1}{2}(d+1)$

It takes multiple substitutions to resolve any individual value, at least to start with. But it is possible to resolve these terms by repeated substitution of how many times $a$ they are, and then proceeding to the next equation getting each variable:

$a = \frac{1}{2}(0 + b) \therefore b = 2a$

$b = \frac{1}{2}(a+c) \rightarrow 2a = \frac{1}{2}(a + c) \therefore c = 3a$

$c = \frac{1}{2}(b+d) \rightarrow 3a = \frac{1}{2}(2a + d) \therefore d = 4a$

$d = \frac{1}{2}(c+e) \rightarrow 4a = \frac{1}{2}(3a + e) \therefore e = 5a$

$e = \frac{1}{2}(d+1) \rightarrow 5a = \frac{1}{2}(4a+1) \therefore a = \frac{1}{6}$

That last one resolves everything, and gives the expected answer. You can also see a very strong pattern here, which should intuitively hold for any length of one dimensional random walk with equal $p=0.5$ - there is a proof for that more general case that would void the need to work the substitutions in detail for each length, but that is more complex.

One other simple intuition: Each state value in turn is the mean of the two state values either side of it, without exception. That means taking any three values next to each other, they should be co-linear; the two higher and lower values with the mean between them. As this holds on all values with overlap, plotting the values in order should place all points all on the same line. That in turn means that there is a simple linear relationship between position in the random walk and the state value.

Neil Slater

Posted 2018-11-08T05:18:25.990

Reputation: 24 613