Why are Q values updated according to the greedy policy?



Apparently, in the Q-learning algorithm, the Q values are not updated according to the "current policy", but according to a "greedy policy". Why is that the case? I think this is related to the fact that Q-learning is off-policy, but I am also not familiar with this concept.

Shifat E Arman

Posted 2018-11-17T16:23:25.537

Reputation: 53



The Q values are updated using a "greedy policy" because, in the Q-learning algorithm, the $\max$ operator is used to determine the "target", which is denoted by

$$\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}$$

Intuitively, the $\max$ operator is used to take the "greedy" action, that is, the action associated with the highest Q value: $\color{blue}{\max_{a}Q(S_{t+1}, a)}$ means that we are selecting the $Q$ value, associated with the (next) state $S_{t+1}$, which corresponds to the action $a$, such that $Q(S_{t+1}, a)$ is the highest (with respect to other possible actions from $S_{t+1}$).

Note that the $Q$ function receives as input a state and an action. So, for each state $s$, we have an action (among all possible actions, $a_1, a_2, \dots$, which we can take from the state $s$), denote it by $a^*$, such that $Q(s, a^*) > Q(s, a_1)$, $Q(s, a^*) > Q(s, a_2)$, etc. In the expression $\color{blue}{\max_{a}Q(S_{t+1}, a)}$, we are basically selecting $Q(s, a^*)$ for $s = S_{t+1}$.

To explain all the components of this target, I will first have to explain other parts of the Q-learning algorithm. Hopefully, after this explanation, you will be able to understand why Q-learning uses the greedy policy to update the Q values. I will first describe the Q-learning algorithm, you will then, hopefully, understand why it is called a "off-policy" algorithm (which is, IMHO, a quite unintuitive and confusing term to describe what "off-policy" actually means).

Here's the Q-learning algorithm

enter image description here

The Q-learning algorithm estimates (and returns) the $Q$ function associated with a policy $\pi$ (which needs to be passed as input). Intuitively, it does that by simulating an "agent" which takes actions in the environment, "observes" the impact (in terms of rewards received and states visited, after having taken those actions) of those actions on the environment, and then attempts to infer the optimal state-action value (or $Q$) function (that is, the optimal Q function that is associated with the optimal policy, which would give the agent the "highest amount of reward" in the long run), given its interaction with the environment.

Q-learning proceeds in episodes. So, initially, you need to pass the number of episodes as input. You can think of episodes as iterations (like in any iterative optimisation algorithm). However, in the context of RL, an episode is a little bit more specific: the start and end of an episode is associated with specific states of the environment: the episode starts when the agent is in an "starting state" and ends when it is in a "ending (or goal) state". In the pseudocode above, $S_0$ is the starting state.

In the pseudocode above, at the beginning of each episode, we initialise $t=0$, where $t$ represents the time step of a specific episode. Note also that the agent "moves" to the starting state, $S_0$, at the beginning of each episode.

We then have the following loop, which terminates when the agents reaches a goal state (denoted, in the pseudocode, by "terminal", i.e. "terminal state" or "goal state" are synonyms in this context):

enter image description here

So, at each episode, we run the loop above. The block of code inside this loop contains the main logic of the Q-learning algorithm. On each iteration of this inner loop, the agent chooses an action $A_t$ (the action at time step $t$ of the current episode) using any policy (which "ensures that all states are sufficiently visited": you can ignore this for now, given that this answer is already quite long!). In this case, the $\epsilon$-greedy policy is used. How does this $\epsilon$-greedy policy work? If you look at the pseudocode above, $\epsilon$ is initialised at beginning of each episode. In the pseudocode above, $\epsilon$ can change from episode to episode, but assume, for simplicity, that, at every episode, it is a fixed small number (e.g. $0.01$). The statement "Choose action $A_t$ using policy derived from $Q$ (e.g., $\epsilon$-greedy)" means that, with probability $1 - \epsilon$, the "greedy action" is chosen (from the current state, which is, at the beginning of the episode, $S_0$), and, with probability $\epsilon$, a random action is taken. What is the greedy action in this case? It is the action, in the current state, which is associated with the highest Q value (given the current estimate of the Q value). The greedy action is exactly the same action as the action $a^*$ (as I explained above). The difference is that, in this case, we choose $A_t$ using the $\epsilon$-greedy policy: so, most of the times, we choose the greedy action, but, sometimes, we can also choose a random action. Note that the $\epsilon$-greedy policy is indeed a policy, because it allows the agent to choose an action, given a state: this is roughly the definition of a policy.

The agent then executes the just chosen action $A_t$ in the environment, and it observes the "impact" of this action on the environment, which is determined by how the environment responds to this action: the response consists of a reward, $R_{t+1}$, and a next state, $S_{t+1}$.

To recapitulate, the agent chooses an action using the $\epsilon$-greedy policy, executes this action on the environment, and it observes the response (that is, a reward and a next state) of the environment to this action. This is the part of the Q-learning algorithm where the agent interacts with environment in order to gather some info about it, so as to be able to estimate the Q function.

After that, the agent can update its estimate of the Q function associated with the policy given as input $\pi$. It does that using the following update rule

$$\color{orange}{Q(S_t, A_t)} \leftarrow \color{red}{Q(S_t, A_t)} + \alpha ([\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}] - \color{red}{Q(S_t, A_t)})$$

where $S_t$ is the current state (of the current episode) the agent is in, $A_t$ is the action chosen using the $\epsilon$-greedy policy (as described above), and $S_{t+1}$ and $R_{t+1}$ are respectively the next state and rewards, which, collectively, are the response of the environment to the just taken action $A_t$.

So, how is the estimate of this $Q$ function updated?

First of all, I would like to note that, if you look at the beginning of the pseudocode above, $Q(s, a)$ is initialized arbitrarily for all states $s \in \mathcal{S}$ and for all actions $a \in \mathcal{A}$: it can e.g. be initialised to $0$. $Q(s, a)$ can e.g. be implemented as a matrix (or 2-dimensional array) $M \in \mathbb{R}^{|\mathcal{S}| \times |\mathcal{A}|}$, where $M[s, a] = Q(s, a)$, $|\mathcal{S}|$ is the number of states in your problem and $|\mathcal{A}|$ the number of actions.

Furthermore, note that the symbol $\leftarrow$ means "assignment" (like assignment to a variable, in the context of programming). So, in the update rule above, we are assigning to $\color{orange}{Q(S_t, A_t)}$ (which will be the next or updated estimate of the Q value for the current state $S_t$ and the just taken action from that state $A_t$) the value $\color{red}{Q(S_t, A_t)} + \alpha (\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)} - \color{red}{Q(S_t, A_t)})$. Let's break this value down.

$\color{red}{Q(S_t, A_t)}$ (on the right side of the assignment) is the estimate of the Q value for the state $S_t$ and action $A_t$ before the assignment. So, we are summing $\color{red}{Q(S_t, A_t)}$ and $\alpha (\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)} - \color{red}{Q(S_t, A_t)})$, and then we assign it to $\color{orange}{Q(S_t, A_t)}$ again.

$\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}$ is what is often called the "target". Q-learning is a "temporal-difference" (TD) algorithm, and TD algorithms update estimates of the value or action-value functions based on the difference between the current estimate, in the case of Q-learning it is denoted by $\color{red}{Q(S_t, A_t)}$ (on the right side of the $\leftarrow$), and a "target". So, in the Q-learning algorithm, $\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}$ is the target. We can roughly think of it as "the value that $\color{red}{Q(S_t, A_t)}$ should have been". So, in a certain way, we are performing "supervised learning", where $\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}$ would be the "ground-truth" label and $\color{red}{Q(S_t, A_t)}$ the current estimate, and so $[\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}] - \color{red}{Q(S_t, A_t)}$ would be the "error" (or "loss"): in fact, it is often called the "TD error". However, note that this is not really supervised learning, because $\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}$ is not a ground-truth (it is partially an estimate, because of the part $\gamma \color{blue}{\max_{a}Q(S_{t+1}, a)} $, and it partially a ground-truth, because of $\color{green}{R_{t+1}}$).

To recapitulate, $\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}$ is the "target", $\color{red}{Q(S_t, A_t)}$ is the current estimate, and $[\color{green}{R_{t+1}} + \gamma \color{blue}{\max_{a}Q(S_{t+1}, a)}] - \color{red}{Q(S_t, A_t)}$ is the "error". We are thus summing the "error" (weighted by the hyper-parameter $\alpha$, which is, in this case, often called the "learning rate") and the current estimate $\color{red}{Q(S_t, A_t)}$ in order to produce the new estimate $\color{orange}{Q(S_t, A_t)}$.

In the target, you can see that we are multiplying the $\color{blue}{\max_{a}Q(S_{t+1}, a)}$ by $\gamma$. This is a hyper-parameter (a parameter which often needs to be chosen by the programmer before the algorithm is executed). It controls the contribution of $\color{blue}{\max_{a}Q(S_{t+1}, a)}$ to the "target": that is, how much of $\color{blue}{\max_{a}Q(S_{t+1}, a)}$ we want to include in the target. Recall that I've just said above that the "target" is composed of the reward $\color{green}{R_{t+1}}$ (which is a "ground-truth" or real-world "experience", because it is directly received from the environment) and $\color{blue}{\max_{a}Q(S_{t+1}, a)}$ (which actually uses an estimate of the Q function, that is, it uses $Q(S_{t+1}, a)$). So, $\gamma$ controls the contribution of an estimate to the "ground-truth".

As I said at the beginning of this answer, $\color{blue}{\max_{a}Q(S_{t+1}, a)}$ can be thought of as the $Q$ value associated with the next state $S_{t+1}$ (which was observed by the agent after he has taken the action $A_t$) and associated with the action $a$, such that $Q(S_{t+1}, a)$ is the highest among all other possible actions from state $S_{t+1}$. In other words, $\color{blue}{\max_{a}Q(S_{t+1}, a)}$ can be thought of the estimate of the Q value associated with the next state $S_{t+1}$ and the "greedy action" taken from that same state.

I would like to note that, at this point, the agent is not really taking actions in the environment. Nonetheless, people, in the RL community, often call Q-learning an "off-policy" algorithm because

  1. It uses the $\epsilon$-greedy policy to interact with the environment (this is often called the "behaviour policy"). In this case, actions are really taken, and the responses of the environment are really produced, observed and used to update estimates of the $Q$ function.

  2. It uses a "target" that is based on an estimate which is "greedy" (i.e. it uses $\color{blue}{\max_{a}Q(S_{t+1}, a)}$).

Given that Q-learning uses estimates of the form $\color{blue}{\max_{a}Q(S_{t+1}, a)}$, Q-learning is often considered to be performing updates to the Q values, as if those Q values were associated with the "greedy policy", that is, the policy that always chooses the action associated with highest Q value. So, you will often hear that Q-learning finds a "target policy" (i.e. the policy that is derived from the last estimate of the Q function) that is "greedy".

I would like to emphasise that, in the Q-learning algorithm, we have have several episodes, and, at each episode, the agent interacts with the environment by taking actions and observing the responses of the environment, so as to produce an estimate of the action-value function (i.e. the $Q$ function).


Posted 2018-11-17T16:23:25.537

Reputation: 19 783

1Small note: start and end of episodes do not necessarily have to be associated with specific states. In theory we can have a probability distribution over initial states $S_0$, although in practice the most common case certainly is that the starting state is always the same. It is relatively common that in practice that there are many different states that can all end the episode, though I believe that is already consistent with what you wrote :) – Dennis Soemers – 2019-02-13T18:48:47.270