It is quite common in DQN to instead of having the neural network represent function $f(s,a) = \hat{q}(s,a,\theta)$ directly, it actually represents $f(s)= [\hat{q}(s,1,\theta), \hat{q}(s,2,\theta), \hat{q}(s,3,\theta) . . . \hat{q}(s,N_a,\theta)]$ where $N_a$ is the maximum action, and the input the current state. That is what is going on here. It is usually done for a performance gain, since calculating all values at once is faster than individually.

However, in a Q learning update, you cannot adjust this vector of output values for actions that you did not take. You can do one of two things:

Figure out the gradient due to the one item with a TD error, and propagate that backwards. This involves inserting a known gradient into the normal training update step in a specific place and working from there. This works best if you are implementing your own backpropagation with low-level tools, otherwise it can be a bit fiddly figuring out how to do it in a framework like Keras.

Force the gradients of all other items to be zero by setting the target outputs to be whatever the learning network is currently generating.

If you are using something like Keras, the second approach is the way to go. A concrete example where you have two networks `n_learn`

and `n_target`

that output arrays of Q values might be like this:

For each sample `(s, a, r, next_s, done)`

in your minibatch*

- Calculate array of action values from your learning network
`qvals = n_learn.predict(s)`

- Calulate TD target for $(s,a)$ e.g.
`td_target = r + max(n_target.predict(next_s))`

(discount factor and how to handle terminal states not shown)
- Alter the one array item that you know about from this sample
`qvals[a] = td_target`

- Append
`s`

to your `train_X`

data and `qvals`

to your `train_Y`

data

Fit the minibatch `n_learn.fit(train_X, train_Y)`

* It is possible to vectorise these calculations for efficiency. I show it as a for loop as it is simpler to describe that way

Thank you again for answering. I think I got that. I also think that I implemented that the right way. Would you mind to look over it and tell my what I got wrong here? https://github.com/OleVoss/DeepTaxi

PS: I tried it with various architectures and hyperparamaters.

Looks ok on a first pass through, but I'm not familiar enough with PyTorch to comment on how you manipulated the q values in vectorised form. There are a few things you could try, including simplifying your network right down to no hidden layers, because essentially it is learning a Q table when each state is one hot encoded, and the weights will actually be equal to the learned Q values. There is no generalisation possible between states due to the way you have encoded them. – Neil Slater – 2020-01-15T15:46:41.843

Ok. Appreciate your try. Technically it is learning something. But every time it is lerning to take just one action... – OleVoß – 2020-01-15T16:05:45.743