In reinforcement learning (RL), there is an *agent* which interacts with an *environment* (in time steps). At each time step, the agent decides and executes an *action*, $a$, on an environment, and the environment responds to the agent by moving from the current *state* (of the environment), $s$, to the next state (of the environment), $s'$, and by emitting a scalar signal, called the *reward*, $r$. In principle, this interaction can continue forever or until e.g. the agent dies.

The main goal of the agent is to collect the largest amount of reward "in the long run". To do that, the agent needs to find an optimal policy (roughly, the optimal strategy to behave in the environment). In general, a policy is a function which, given a current state of the environment, outputs an action (or a probability distribution over actions, if the policy is *stochastic*) to execute in the environment. A policy can thus be thought of as the "strategy" used by the agent to behave in this environment. An optimal policy (for a given environment) is a policy which, if followed, will make the agent collect the largest amount of reward in the long run (which is the goal of the agent). In RL, we are thus interested in finding optimal policies.

The environment can be deterministic (that is, roughly, the same action in the same state leads to the same next state, for all time steps) or stochastic (or non-deterministic), that is, if the agent takes an action in a certain state, the resulting next state of the environment might not necessarily always be the same: there is a probability that it will be a certain state or another. Of course, these uncertainties will make the task of finding the optimal policy harder.

In RL, the problem is often mathematically formulated as a Markov decision process (MDP). A MDP is a way of representing the "dynamics" of the environment, that is, the way the environment will react to the possible actions the agent might take, at a given state. More precisely, an MDP is equipped with a **transition function** (or "transition model"), which is a function that, given the current state of the environment and an action (that the agent might take), outputs a probability of moving to any of the next states. A **reward function** is also associated with an MDP. Intuitively, the reward function outputs a reward, given the current state of the environment (and, possibly, an action taken by the agent and the next state of the environment). Collectively, the transition and reward functions are often called the **model** of environment. To conclude, the MDP is the problem and the solution to the problem is a policy. Furthermore, the "dynamics" of the environment are governed by the transition and reward functions (that is, the "model").

However, we often do not have the MDP, that is, we do not have the transition and reward functions (of the MDP associated the environment). Hence, we cannot estimate a policy from the MDP, because it is unknown. Note that, in general, if we had the transition and reward functions of the MDP associated with the environment, we could exploit them and retrieve an optimal policy (using dynamic programming algorithms).

In the absence of these functions (that is, when the MDP is unknown), to estimate the optimal policy, the agent needs to interact with environment and observe the responses of the environment. This is often referred to as the "reinforcement learning problem", because the agent will need to estimate a policy by *reinforcing* its beliefs about the dynamics of the environment. Over time, the agent starts to understand how the environment responds to its actions, and it can thus start to estimate the optimal policy. Thus, in the RL problem, the agent estimates the optimal policy to behave in an unknown (or partially known) environment by interacting with it (using a "trial-and-error" approach).

In this context, a **model-based** algorithm is an algorithm that uses the transition function (and the reward function) in order to estimate the optimal policy. The agent might have access only to an approximation of the transition function and reward functions, which can be learned by the agent while it interacts with the environment or it can be given to the agent (e.g. by another agent). In general, in a model-based algorithm, the agent can potentially predict the dynamics of the environment (during or after the learning phase), because it has an estimate of the transition function (and reward function). However, note that the transition and reward functions that the agent uses in order to improve its estimate of the optimal policy might just be approximations of the "true" functions. Hence, the optimal policy might never be found (because of these approximations).

A **model-free** algorithm is an algorithm that estimates the optimal policy without using or estimating the dynamics (transition and reward functions) of the environment. In practice, a model-free algorithm either estimates a "value function" or the "policy" directly from experience (that is, the interaction between the agent and environment), without using neither the transition function nor the reward function. A value function can be thought of as a function which evaluates a state (or an action taken in a state), for all states. From this value function, a policy can then be derived.

In practice, one way to distinguish between model-based or model-free algorithms is to look at the algorithms and see if they use the transition or reward function.

For instance, let's look at the main update rule in the *Q-learning algorithm*:

$$Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha (R_{t+1} + \gamma \max_{a}Q(S_{t+1}, a) - Q(S_t, A_t))$$

As we can see, this update rule does not use any probabilities defined by the MDP. Note: $R_{t+1}$ is just the reward that is obtained at the next time step (after taking the action), but it is not necessarily known beforehand. So, Q-learning is a model-free algorithm.

Now, let's look at the main update rule of the *policy improvement* algorithm:

$$Q(s,a) \leftarrow \sum_{s' \in \mathcal{S}, r\in\mathcal{R}}p(s',r|s,a)(r+\gamma V(s'))$$

We can immediately observe it uses $p(s',r|s,a)$, a probability defined by the MDP model. So, *policy iteration* (a dynamic programming algorithm), which uses the policy improvement algorithm, is a model-based algorithm.

See also this Quora post What is the difference between model-based and model-free reinforcement learning?.

– nbro – 2018-12-09T16:25:06.363How do you mean that you could reframe a model-free learner as a model-based? – HelloGoodbye – 2019-02-11T01:40:37.123