# What is reinforcement learning?

In reinforcement learning (RL), you typically imagine that there's an agent that interacts, in time steps, with an environment by taking actions. On each time step $t$, the agent takes the action $a_t \in \mathcal{A}$ in the state $s_t \in \mathcal{S}$, receives a reward (or reinforcement) signal $r_t \in \mathbb{R}$ from the environment and the agent and the environment move to another state $s_{t+1} \in \mathcal{S}$, where $\mathcal{A}$ is the action space and $\mathcal{S}$ is the state space of the environment, which is typically assumed to be a Markov decision process (MDP).

# What is the goal in RL?

The goal is to find a policy that maximizes the **expected return**
(i.e. a sum of rewards starting from the current time step). The policy that maximizes the expected return is called the **optimal policy**.

## Policies

A policy is a function that maps states to actions. Intuitively, the policy is the strategy that implements the behavior of the RL agent while interacting with the environment.

A policy can be deterministic or stochastic. A deterministic policy can be denoted as $\pi : \mathcal{S} \rightarrow \mathcal{A}$. So, a deterministic policy maps a state $s$ to an action $a$ with probability $1$. A stochastic policy maps states to a probability distribution over actions. A stochastic policy can thus be denoted as $\pi(a \mid s)$ to indicate that it is a conditional probability distribution of an action $a$ given that the agent is in the state $s$.

## Expected return

The expected return can be formally written as

$$\mathbb{E}\left[ G_t \right] = \mathbb{E}\left[ \sum_{i=t+1}^\infty R_i \right]$$

where $t$ is the current time step (so we don't care about the past), $R_i$ is a random variable that represents the probable reward at time step $i$, and $G_t = \sum_{i=t+1}^\infty R_i $ is the so-called *return* (i.e. a sum of future rewards, in this case, starting from time step $t$), which is also a random variable.

## Reward function

In this context, the most important job of the human programmer is to define a function $\mathcal{R}: \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$, the reward function, which provides the *reinforcement* (or reward) signal to the RL agent while interacting with the environment. $\mathcal{R}$ will deterministically or stochastically determine the reward that the agent receives every time it takes action $a$ in the state $s$. The reward function $R$ is also part of the environment (i.e. the MDP).

Note that $\mathcal{R}$, the reward function, is different from $R_i$, which is a random variable that represents the reward at time step $i$. However, clearly, the two are very related. In fact, the reward function will determine the actual *realizations* of the random variables $R_i$ and thus of the return $G_i$.

## How to estimate the optimal policy?

To estimate the optimal policy, you typically design optimization algorithms.

### Q-learning

The most famous RL algorithm is probably Q-learning, which is also a numerical and iterative algorithm. Q-learning implements the interaction between an RL agent and the environment (described above). More concretely, it attempts to estimate a function that is closely related to the policy and from which the policy can be derived. This function is called the **value function**, and, in the case of Q-learning, it's a function of the form $Q : \mathcal{S} \times \mathcal{A} \rightarrow \mathbb{R}$. The name $Q$-learning derives from this function, which is often denoted as $Q$.

Q-learning doesn't necessarily find the optimal policy, but there are cases where it is guaranteed to find the optimal policy (but I won't dive into the details).

Of course, I cannot describe all the details of Q-learning in this answer. Just keep in mind that, to estimate a policy, in RL, you will typically use a numerical and iterative optimization algorithm (e.g. Q-learning).

## What is training in RL?

In RL, training (also known as *learning*) generally refers to the use of RL algorithms, such as Q-learning, to estimate the optimal policy (or a value function)

Of course, as in any other machine learning problem (such as supervised learning), there are many practical considerations related to the implementation of these RL algorithms, such as

- Which RL algorithm to use?
- Which programming language, library, or framework to use?

These and other details (which, of course, I cannot list exhaustively) can actually affect the policy that you obtain. However, the basic goal during the learning or training phase in RL is to find a policy (possibly, optimal, but this is almost never the case).

## What is evaluation (or testing) in RL?

During learning (or training), you may not be able to find the optimal policy, so, when you want to use your learned policy to solve the actual real-world problem, should you strictly follow the action that gives you the highest reward at every state or maybe should you stochastically decide between actions? These are questions that you need to answer before deploying your RL algorithm.

The 12.6 Evaluating Reinforcement Learning Algorithms of the book Artificial Intelligence: Foundations of Computational Agents (2017) by David Poole and Alan Mackworth provides a section completely dedicated to the evaluation of reinforcement learning algorithms.

The evaluation phase of an RL algorithm is the assessment of the quality of the learned policy and how much reward the agent obtains if it follows that policy. A typical metric that can be used to assess the quality of the policy is to plot the sum of all rewards received so far as a function of the number of steps. One RL algorithm dominates another if its plot is consistently above the other.

You should note that the evaluation phase can actually occur during the training phase too.

The linked section provides more details, so I suggest you read it!

## What is the difference between training and evaluation?

During training, you want to find the policy. During the evaluation, you want to assess the quality of the learned policy. You can perform evaluation even during training.

1As RL is not a supervised algorithm (it's a third type of ML algorithms), you can't the same expectation from testing and training algorithm here, as well as, supervised algorithms. – OmG – 2020-05-04T15:54:26.073

@OmG OK. So, as I understand from you, this concept does not apply to RL? – Cristian M – 2020-05-04T16:03:31.657

1Not the same as the supervised learning. – OmG – 2020-05-04T17:31:11.457