What is the difference between return and expected return?

3

At a time step $t$, for a state $S_{t}$, the return is defined as the discounted cumulative reward from that time step $t$.

If an agent is following a policy (which in itself is a probability distribution of choosing a next state $S_{t+1}$ from $S_{t}$), the agent wants to find the value at $S_{t}$ by calculating sort of "weighted average" of all the returns from $S_{t}.$ This is called the expected return.

Is my understanding correct?

digi philos

Posted 2019-06-30T15:12:14.500

Reputation: 31

Answers

0

Formally, the return (also known as the cumulative future discounted reward) can be defined as

$$ G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}, $$

where $0 \leq \gamma \leq 1$ is the discount factor and $R_{i}$ is the reward at time step $i$. Here $G_t$ and $R_i$ are considered random variables (and r.v.s are usually denoted with capital letters, so I am using the notation used in the book Reinforcement Learning: An Introduction, 2nd edition).

The expected return is defined as

\begin{align} v^\pi(s) &= \mathbb{E}\left[G_t \mid S_t = s \right] \\ &= \mathbb{E}\left[\sum_{k=0}^\infty \gamma^k R_{t+k+1} \bigm\vert S_t = s \right] \end{align}

In other words, the value of a state $s$ (associated with a policy $\pi$) is equal to the expectation of the return $G_t$ given that $S_t = s$, so $v^\pi(s)$ is defined as a conditional expectation. Note also that the expected value is usually defined with respect to a random variable, which is the case. Note also that $S_t$ is a random variable, while $s$ is a realization of this random variable.

A policy is not a probability distribution of choosing the next state. A stochastic policy is a family of a conditional probability distribution over actions given states. There are also deterministic policies. Have a look at this question What is the difference between a stochastic and a deterministic policy? for more details about the definition of stochastic and deterministic policies.

If an agent is following a policy, the agent wants to find the value at $S_{t}$ by calculating sort of "weighted average" of all the returns from $S_{t}.$ This is called the expected return.

In the case of Monte Carlo Prediction, the value of a state associated with a specific policy, that is, the expected value of the return given a state is approximated with a finite (weighted) average. See e.g. What is the difference between First-Visit Monte-Carlo and Every-Visit Monte-Carlo Policy Evaluation?. Furthermore, note that the expectation of a discrete random variable is defined as a weighted average. However, the return is not a discrete random variable, but, in general, it is a continuous one.

nbro

Posted 2019-06-30T15:12:14.500

Reputation: 19 783

2It would help the OP I think if you more carefully differentiate and explain the difference between random distributions (usually noted with capital letters e.g. $R_t$) and data/observations (usually noted with lower-case letters e.g. $r_t$). The notation you are currently using in this answer is very loose in that regard – Neil Slater – 2019-06-30T16:33:06.557

0

You're correct, the return is the discounted future reward from the one iteration while the expected return is averaged over a bunch of iterations.

lzl

Posted 2019-06-30T15:12:14.500

Reputation: 101