Reward dependent on (state, action) versus (state, action, successor state)



I am studying reinforcement learning and I am working methodically through Sutton and Barto's book plus David Silver's lectures.

I have noticed a minor difference in how the Markov Decision Processes (MDPs) are defined in those two sources, that affects the formulation of the Bellman equations, and I wonder about the reasoning behind the differences and when I might choose one or the other.

In Sutton and Barto, the expected reward function is written $R^a_{ss'}$, whilst in David Silver's lectures it is written $R^a_{s}$. In turn this leads to slightly different formulations of all the Bellman equations. For instance, in Sutton and Barto, the equation for policy evaluation is given by:

\begin{align} v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s'} P_{ss'}^a(R_{ss'}^a + \gamma v_{\pi}(s')) \end{align}

Whilst David Silver's lectures show:

\begin{align} v_{\pi}(s) = \sum_a \pi(a|s) \left(R_{s}^a + \gamma \sum_{s'} P_{ss'}^a v_{\pi}(s') \right) \end{align}

In both cases:

  • $\pi(a|s)$ is policy function - probability of choosing action $a$ given state $s$.
  • $\gamma$ is discount factor.
  • $P_{ss'}^a$ is transition function, probability of state changing to $s'$ given $s, a$

I understand that $R_{ss'}^a$ and $R_{s}^a$ are related (via $P_{ss'}^a$), so that these two sources are explaining the exact same thing. Note that the first equation can also be written as

\begin{align} v_{\pi}(s) &= \sum_a \pi(a|s) \sum_{s'} (P_{ss'}^aR_{ss'}^a + \gamma P_{ss'}^av_{\pi}(s'))\\ &= \sum_a \pi(a|s) \left( \sum_{s'} P_{ss'}^aR_{ss'}^a + \sum_{s'} \gamma P_{ss'}^av_{\pi}(s') \right) \\ &= \sum_a \pi(a|s) \left( \sum_{s'} P_{ss'}^aR_{ss'}^a + \gamma \sum_{s'} P_{ss'}^av_{\pi}(s') \right) \end{align}

Hence, it must be true that $R_{s}^a = \sum_{s'} P_{ss'}^a R_{ss'}^a$.

My question is whether there is any reason I should prefer to use one or the other notation?

I started with Sutton and Barto, and find that notation more intuitive - the reward may depend on the eventual state, and this is explicit in the equations. However, it looks like in practice that the notation used in the video lectures describes more efficient calculations (essentially $R_{s}^a = \sum_{s'} P_{ss'}^a R_{ss'}^a$ is cached, if the formula is translated directly to code). Is that all there is to it?

Neil Slater

Posted 2017-03-25T08:39:13.420

Reputation: 24 613



Your intuition is correct. In the most general case (Sutton's definitions), the model of the environment consists of the state transition distribution and the reward distribution. The latter one is rarely considered, as many times the reward is being assigned by the modeler, dependent only by the selected action from the current state and being deterministic. As you mentioned it simplifies a lot the coding implementation.


Posted 2017-03-25T08:39:13.420

Reputation: 1 531