## How can a DQN backpropagate its loss?

4

1

I'm currently trying to take the next step in deep learning. I managed so far to write my own basic feed-forward network in python without any frameworks (just numpy and pandas), so I think I understood the math and intuition behind backpropagation. Now, I'm stuck with deep q-learning. I've tried to get an agent to learn in various environments. But somehow nothing works out. So there has to be something I'm getting wrong. And it seems that I do not understand the critical part right at least that's what I'm thinking. The screenshot is from this video.

What I'm trying to draw here is my understanding of the very basic process of a simple DQN. Assuming this is right, how is the loss backpropagated? Since only the selected $$Q(s, a)$$ values (5 and 7) are further processed in the loss function, how is the impact from the other neurons calculated so their weights can be adjusted to better predict the real q-values?

1

It is quite common in DQN to instead of having the neural network represent function $$f(s,a) = \hat{q}(s,a,\theta)$$ directly, it actually represents $$f(s)= [\hat{q}(s,1,\theta), \hat{q}(s,2,\theta), \hat{q}(s,3,\theta) . . . \hat{q}(s,N_a,\theta)]$$ where $$N_a$$ is the maximum action, and the input the current state. That is what is going on here. It is usually done for a performance gain, since calculating all values at once is faster than individually.

However, in a Q learning update, you cannot adjust this vector of output values for actions that you did not take. You can do one of two things:

• Figure out the gradient due to the one item with a TD error, and propagate that backwards. This involves inserting a known gradient into the normal training update step in a specific place and working from there. This works best if you are implementing your own backpropagation with low-level tools, otherwise it can be a bit fiddly figuring out how to do it in a framework like Keras.

• Force the gradients of all other items to be zero by setting the target outputs to be whatever the learning network is currently generating.

If you are using something like Keras, the second approach is the way to go. A concrete example where you have two networks n_learn and n_target that output arrays of Q values might be like this:

• For each sample (s, a, r, next_s, done) in your minibatch*

• Calculate array of action values from your learning network qvals = n_learn.predict(s)
• Calulate TD target for $$(s,a)$$ e.g. td_target = r + max(n_target.predict(next_s)) (discount factor and how to handle terminal states not shown)
• Alter the one array item that you know about from this sample qvals[a] = td_target
• Append s to your train_X data and qvals to your train_Y data
• Fit the minibatch n_learn.fit(train_X, train_Y)

* It is possible to vectorise these calculations for efficiency. I show it as a for loop as it is simpler to describe that way

Thank you again for answering. I think I got that. I also think that I implemented that the right way. Would you mind to look over it and tell my what I got wrong here? https://github.com/OleVoss/DeepTaxi

PS: I tried it with various architectures and hyperparamaters.

– OleVoß – 2020-01-15T14:46:48.293

Looks ok on a first pass through, but I'm not familiar enough with PyTorch to comment on how you manipulated the q values in vectorised form. There are a few things you could try, including simplifying your network right down to no hidden layers, because essentially it is learning a Q table when each state is one hot encoded, and the weights will actually be equal to the learned Q values. There is no generalisation possible between states due to the way you have encoded them. – Neil Slater – 2020-01-15T15:46:41.843

Ok. Appreciate your try. Technically it is learning something. But every time it is lerning to take just one action... – OleVoß – 2020-01-15T16:05:45.743