Understanding the notation in the definition of the expected reward



I am new to RL and I am trying to work through the book Reinforcement Learning: An Introduction I (Sutton & Barto, 2018). In chapter 3 on Finite Markov Decision Processes, the authors write the expected reward as

$$r(s,a) = \mathbb{E}\left[R_t|S_{t-1}=s,A_{t-1}=a\right]=\sum_{r\in \mathcal{R}}r\sum_{s'\in \mathcal{S}}p(s',r|s,a)$$

I am not sure if the authors mean

$$r(s,a) = \mathbb{E}\left[R_t|S_{t-1}=s,A_{t-1}=a\right]=\sum_{r\in \mathcal{R}}\left[r\sum_{s'\in \mathcal{S}}p(s',r|s,a)\right]$$


$$r(s,a) = \mathbb{E}\left[R_t|S_{t-1}=s,A_{t-1}=a\right]=\left[\sum_{r\in \mathcal{R}}r\right]\cdot\left[\sum_{s'\in \mathcal{S}}p(s',r|s,a)\right].$$

If the authors mean the first, is there any reason why it is not written like the following?

$$r(s,a) = \mathbb{E}\left[R_t|S_{t-1}=s,A_{t-1}=a\right]=\sum_{r\in \mathcal{R}}\sum_{s'\in \mathcal{S}}\left[r\,p(s',r|s,a)\right]$$


Posted 2018-10-18T10:38:39.577

Reputation: 215



Your first option is correct:

$$r(s,a) = \mathbb{E}\left[R_t|S_{t-1}=s,A_{t-1}=a\right]=\sum_{r\in \mathcal{R}}\left[r\sum_{s'\in \mathcal{S}}p(s',r|s,a)\right]$$

It's partly a matter of taste, but I prefer not moving the $r$ into the double sum, because its value does not change in the "inner loop". There is a small amount of intuition to be had that way around, especially when it comes to implementation (it is one multiplication after the sum, as opposed to many within the sum).

There are a lot of sums containing sums in Sutton & Barto, and they mainly follow the convention of not using any parentheses or brackets to show the one sum containing the other explicitly.

In this case, the formulae help link to other treatments of RL, which work with the expected reward functions $r(s,a)$ or $r(s,a,s')$, or reward matrices $R_s^a$, $R_{ss'}^a$ such as the first edition of Sutton & Barto's book. The second edition of the book uses $p(s', r|s, a)$ almost everywhere though, and you won't see $r(s,a)$ mentioned much again. So it's not worth getting too concerned about how it is presented and what the author might be saying with the presentation.

Generally you don't need to know the distribution of reward, just its expectation (and how that depends on $s, a, s'$), in order to derive and explain most of the results in RL. So using $r(s,a)$ and similar functions is fine, in places like the Bellman equations. However, the use of $p(s', r|s, a)$ is general without needing to bring in more functions describing the MDP.

Neil Slater

Posted 2018-10-18T10:38:39.577

Reputation: 14 632