4

I understand that in DQNs, the loss is measured by taking the MSE of outputted Q-values and target Q-values.

Whats does the target Q-values represent? And how is it obtained/calculated by the DQN?

4

I understand that in DQNs, the loss is measured by taking the MSE of outputted Q-values and target Q-values.

Whats does the target Q-values represent? And how is it obtained/calculated by the DQN?

3

Whats does the target Q-values represent?

In a DQN, which uses off-policy learning, they represent a *refined* estimate for the expected future reward from taking an action $a$ in state $s$, and from that point on following a target policy. The target policy in Q learning is based on always taking the maximising action in each state, according to current estimates of value.

The estimate is *refined* in that it is based on at least a little bit of data from experience - the immediate reward, and what transition happened next - but generally it is not going to be perfect.

And how is it obtained/calculated by the DQN?

There are lots of ways to do this. The simplest in DQN is to process a single step lookahead based on the experience replay table.

If your table contains the tuple *[state, action, immediate reward, next state, done?]* as $[s, a, r, s', d]$ then the formula for TD target, $g_{t:t+1}$ is

$$r + \gamma \text{max}_{a'}[Q_{target}(s',a')], \qquad \text{when}\space d \space \text{is false}$$

$$r, \qquad \text{when}\space d \space \text{is true}$$

Typically $Q_{target}$ is calculated using the "target network" which is a copy of the learning network for Q that is updated every N steps. This delayed update of the target predictions is done for numerical stability in DQN - conceptually it is an estimate for the same action values that you are learning.

This target value can change every time you use any specific memory from experience replay. So you have to perform the same calculations on each minibatch, you cannot store the target values.

3

The deep Q-learning (DQL) algorithm is really similar to the tabular Q-learning algorithm. I think that both algorithms are actually quite simple, at least, if you look at their pseudocode, which isn't longer than 10-20 lines.

Here's a screenshot of the pseudocode of DQL (from the original paper) that highlights the Q target.

Here's the screenshot of Q-learning (from Barto and Sutton's book) that highlights the Q target.

In both cases, the $\color{red}{\text{target}}$ is a **reward plus a discounted maximum future Q value** (apart from the exception of final states, in the case of DQL, where the target is just the reward).

There are at least 3 differences between these two algorithms.

DQL uses gradient descent because the $Q$ function are represented by neural networks rather than tables, like in Q-learning, and so you have an explicit loss function (e.g. MSE).

DQL typically uses experience replay (but, in principle, you could also do this in Q-learning)

- DQL encodes the states (i.e. $\phi$ encodes the states).

Apart from that, the logic of both algorithms is more or less the same, so, if you know Q-learning (and you should know it before diving into DQL), then it shouldn't be a problem to learn DQL (if you also have a decent knowledge of deep learning).

Ah, so for Q-learning I use the Q learning update rule to make my Q function values converge, while for DQN I use gradient descent to make my Q function values converge. How does the company batch of stored experiences come into play though? (Used to feed the neural net?) – BG10 – 2020-04-19T23:54:50.653

@BG10 Sorry, I realized I had already said that about gradient descent. Regarding the experience replay, it's used to stabilize the neural network learning, given that successive states in a typical RL task are correlated and NNs don't learn well when data is highly correlated, so, rather than performing gradient descent with successive tuples of experience, you perform it with randomly selected tuples from the experience replay buffer. In Q-learning, you don't have this issue because you don't have NNs. – nbro – 2020-04-20T00:02:49.697

I understand the part where we want the tuples of experiences sampled to be independent and identically distributed. However, when you say 'to stabilize the neural network learning' , does this mean you want the neural network to give better Q-function outputs, through gradient descent of the randomized samples? – BG10 – 2020-04-20T00:28:06.960

@BG10 By "stabilize" I mean that your performance metric doesn't go up and down, but more regularly increases or decreases. In this case, the performance could be the metric or the Q values. If you look at original paper for DQL, the authors say a similar thing.

– nbro – 2020-04-20T00:32:42.833So you use random samples to ensure outputted Q-values for the same state and action, but from different episodes, are less volatile? And gradient descent is used to make the estimated Q values(outputted from the Q-network) closer to the target Q values(outputted from the target network) – BG10 – 2020-04-20T00:41:03.610

@BG10 I wouldn't use the word "volatile" in this case. I would use "variable" or "unstable". Also, not necessarily only from different episodes. In general, during the whole training, you will be collecting rewards. You can e.g. print a plot of the amount of rewards you obtain at every step or episode. That will give you an idea of the performance (or not). Note: this is just an example to make you understand. I am not saying that the authors of DQN did this. Regarding your second question, GD is used to minimize a loss, which happens to be a function of the predicted and target Q values. – nbro – 2020-04-20T02:33:32.930

2

When training a Deep Q network with experienced replay, you accumulate what is known as training experiences $e_t = (s_t, a_t, r_t, s_{t+1})$. You then sample a batch of such experiences and for each sample you do the following.

- Feed $s_t$ into the network to get $Q(s,a;\theta)$.
- Feed $s_{t+1}$ into the network to get $Q(s’,a’,\theta)$.
- Choose $max_aQ(s’,a’,\theta)$ and set $ \gamma max_aQ(s’,a’,θ)$ + $r_t$ as the target of the network.
- Train the network with $s_t$ as input to update $\theta$. The output from the input of $s_t$ is $Q(s,a,\theta)$ and the gradient descent step minimises the squared distance between $Q(s,a,\theta)$ and $\gamma max_aQ(s’,a’,θ)$ + $r_t$

Ah, so the target values of Q(s,a) = Reward for performing action a in state s + Max Q( s′,a′ ) for next state s′. I can find out the Max Q( s′,a′ ) by looking through my batch of stored experiences( since I have data on the Q values of next state s′), and hence find the target value of Q(s,a) – BG10 – 2020-04-19T12:27:45.833

Ur neural network outputs all possible Q(s’, a’) values over all actions. Hence, u can find the max of these values easily – calveeen – 2020-04-19T16:42:11.457

Oh, so my batch of stored experiences will be used as input to my neural net, which will output all the possible Q values of that state S( If I only inputted 1 state) / States S,S',...(If I inputted more than 1 state) – BG10 – 2020-04-19T23:56:56.087

ur batch of samples can contain different s, s' pairs. the input to a neural network is the state. I am not sure what u mean by "inputted 1 state" ? – calveeen – 2020-04-20T04:46:45.717

See also this related question: Why is the target $r + \gamma \max_{a'} Q(s', a'; \theta_i^-)$ in the loss function of the DQN architecture?.

– nbro – 2020-04-19T20:55:35.960