9

The first-visit and the every-visit Monte-Carlo (MC) algorithms are both used to solve the ** prediction problem** (or, also called, "evaluation problem"), that is, the problem of estimating the value function associated with a given (as input to the algorithms) fixed (that is, it does not change during the execution of the algorithm) policy, denoted by $\pi$. In general, even if we are given the policy $\pi$, we are not necessarily able to find the exact corresponding value function, so these two algorithms are used to

Intuitively, we care about the value function associated with $\pi$ because we might want or need to know "how good it is to be in a certain state", if the agent behaves in the environment according to the policy $\pi$.

For simplicity, assume that the value function is the state value function (but it could also be e.g. the state-action value function), denoted by $v_\pi(s)$, where $v_\pi(s)$ is the *expected return* (or, in other words, *expected cumulative future discounted reward*), starting from state $s$ (at some time step $t$) and then following (after time step $t$) the given policy $\pi$. Formally, $v_\pi(s) = \mathbb{E}_\pi [ G_t \mid S_t = s ]$, where $G_t = \sum_{k=0}^\infty \gamma^k R_{t+k+1}$ is the **return** (after time step $t$).

In the case of MC algorithms, $G_t$ is often defined as $\sum_{k=0}^T R_{t+k+1}$, where $T \in \mathbb{N}^+$ is the last time step of the episode, that is, the sum goes up to the final time step of the episode, $T$. This is because MC algorithms, in this context, often assume that the problem can be naturally split into **episodes** and each episode proceeds in a discrete number of **time steps** (from $t=0$ to $t=T$).

As I defined it here, the return, in the case of MC algorithms, is only associated with a single episode (that is, it is the return of one episode). However, in general, the expected return can be different from one episode to the other, but, for simplicity, we will assume that the expected return (of all states) is the same for all episodes.

To recapitulate, the first-visit and every-visit MC (prediction) algorithms are used to estimate $v_\pi(s)$, for all states $s \in \mathcal{S}$. To do that, at every episode, these two algorithms use $\pi$ to behave in the environment, so that to obtain some knowledge of the environment in the form of sequences of states, actions and rewards. This knowledge is then used to estimate $v_\pi(s)$. *How is this knowledge used in order to estimate $v_\pi$?* Let us have a look at the pseudocode of these two algorithms.

$N(s)$ is a "counter" variable that counts the number of times we visit state $s$ throughout the entire algorithm (i.e. from episode one to $num\_episodes$). $\text{Returns(s)}$ is a list of (undiscounted) returns for state $s$.

I think it is more useful for you to read the pseudocode (which should be easily translatable to actual code) and understand what it does rather than explaining it with words. Anyway, the basic idea (of both algorithms) is to generate trajectories (of states, actions and rewards) at each episode, keep track of the returns (for each state) and number of visits (of each state), and then, at the end of all episodes, average these returns (for all states). This average of returns should be an approximation of the expected return (which is what we wanted to estimate).

The differences of the two algorithms are highlighted in $\color{red}{\text{red}}$. The part "*If state $S_t$ is not in the sequence $S_0, S_1, \dots, S_{t-1}$*" means that the associated block of code will be executed only if $S_t$ is

Do not get confused by the fact that, within each episode, we proceed from the time step $T-1$ to time step $t = 0$, that is, we process the "episode sequence" backwards. We are doing that only to more conveniently compute the returns (given that the returns are *iteratively* computed as follows $G \leftarrow G + R_{t+1}$).

So, intuitively, in the first-visit MC, we only update the $\text{Returns}(S_t)$ (that is, the list of returns for state $S_t$, that is, the state of the episode at time step $t$) the first time we encounter $S_t$ in that same episode (or trajectory). In the every-visit MC, we update the list of returns for the state $S_t$ every time we encounter $S_t$ in that same episode.

For more info regarding these two algorithms (for example, the convergence properties), have a look at section 5.1 (on page 92) of the book "Reinforcement Learning: An Introduction" (2nd edition), by Andrew Barto and Richard S. Sutton.

0

For anyone coming across this question and wants a very intuitive understanding of first and every visit monte-carlo, look at the answer given in the link provided here.

After looking at that intuition, then you can come back and look at nbroz answer provided above.

Hope this helps anyone struggling with this idea

Hi. Rather than just providing a link to an external source, can you formulate in your own words that intuition here? Please, edit your post to do so. Btw, have a look at https://ai.stackexchange.com/help/on-topic.

– nbro – 2020-06-28T11:20:14.340
Comments are not for extended discussion; this conversation has been moved to chat.

– nbro – 2020-03-06T03:46:42.310