## My DQN is stuck and can't see where the problem is

3

1

I'm trying to replicate the DeepMind paper results, so I implemented my own DQN. I left it training for more than 4 million frames (more than 2000 episodes) on SpaceInvaders-v4 (OpenAI-Gym) and it couldn't finish a full episode. I tried two different learning rates (0.0001 and 0.00125) and seems to work better with 0.0001, but the median score never raises above 200. I'm using a double DQN. Here is my code and some photos of the graphs I'm getting each session. Between sessions I'm saving the network weights; I'm updating the target network every 1000 steps. I can't see if I'm doing something wrong, so any help would be appreciated. I'm using the same CNN construction as the DQN paper.

Here's the action selection function; it uses a batch of 4 80x80 processed experiences in grayscale to select the action (s_batch means for state batch):

    def action_selection(self, s_batch):
action_values = self.parallel_model.predict(s_batch)
best_action = np.argmax(action_values)
best_action_value = action_values[0, best_action]
random_value = np.random.random()

if random_value < AI.epsilon:
best_action = np.random.randint(0, AI.action_size)
return best_action, best_action_value


Here is my training function. It uses the past experiences as training; I tried to implement that if it lose any life, it wouldn't get any extra rewards, so in theory, the agent would try to not die:

    def training(self, replay_s_batch, replay_ns_batch):
Q_values = []
batch_size = len(AI.replay_s_batch)
Q_values = np.zeros((batch_size, AI.action_size))

for m in range(batch_size):

Q_values[m] = self.parallel_model.predict(AI.replay_s_batch[m].reshape(AI.batch_shape))
new_Q = self.parallel_target_model.predict(AI.replay_ns_batch[m].reshape(AI.batch_shape))
Q_values[m, [item[0] for item in AI.replay_a_batch][m]] = AI.replay_r_batch[m]

if np.all(AI.replay_d_batch[m] == True):
Q_values[m, [item[0] for item in AI.replay_a_batch][m]] = AI.gamma * np.max(new_Q)

if lives == 0:
loss = self.parallel_model.fit(np.asarray(AI.replay_s_batch).reshape(batch_size,80,80,4), Q_values, batch_size=batch_size, verbose=0)

if AI.epsilon > AI.final_epsilon:
AI.epsilon -= (AI.initial_epsilon-AI.final_epsilon)/AI.epsilon_decay


replay_s_batch it's a batch of (batch_size) experience replay states (packs of 4 experiences), and replay_ns_batch it's full of 4 next states. The batch size is 32.

And here are some results, after training:
In blue, the loss (I think it's correct; it's near-zero). Red dots are the different match scores (as you can see, it does sometimes really good matches). In green, the median (near 190 in this training, with learning rate = 0.0001) Here is the last training, with lr = 0.00125; the results are worse (it's median it's about 160). Anyway the line it's almost straight, I don't see any variation in any case. So anyone can point me to the right direction? I tried a similar approach with pendulum and it worked properly. I know that with Atari games it takes more time but a week or so I think it's enough, and it seems to be stuck. In case someone need to see another part of my code just tell me.

Edit: With the suggestions provided, I modified the action_selection function. Here it is:

def action_selection(self, s_batch):
if np.random.rand() < AI.epsilon:
best_action = env.action_space.sample()
else:
action_values = self.parallel_model.predict(s_batch)
best_action = np.argmax(action_values[0])
return best_action


To clarify my last edit: with action_values you get the q values; with best_action you get the action which corresponds to the max q value. Should I return that or just the max q value?

That's a lot to read, cannot promise anyone would work through all that, though kudos to you for giving all the information! One question: Have you gone straight in to implementing a DQN on this Atari problem, or have you tried some of the simpler environments first? There are quite a few hyper-parameters and implementation details where your attempt could go wrong, and a few simpler environments first gives you a chance to get some of this correct with less to debug all at once. – Neil Slater – 2019-02-22T23:49:40.527

I tried first the pendulum but it doesn't uses CNN. It worked properly. Are you referring to that Neil? I think if there is a problem it's on my action selection or training. Should I edit the question? – JCP – 2019-02-23T00:04:39.167

Yes, that's the sort of thing I mean. You could add to the question that your code worked OK on inverted pendulum. Also worth saying which parts you have changed since then. Don't add too much detail though, this is already a very long question – Neil Slater – 2019-02-23T00:08:30.030

Done! Hope it's more readable, thanks Neil! – JCP – 2019-02-23T00:20:21.653

1just an advice to make your code a bit faster, in action_selection method you are making a pass through ANN and calculating max action, and only after taking into account epsilon for random action choice. During training especially in early episode all that calculation will be done for nothing because you will end up taking a random action anyways. It's better to consider epsilon-greedy action choice first and only if the action isn't random do ANN pass. Also in that method you seem to be returning best action value regardless if action is random or not. Not sure if that changes anything – Brale – 2019-02-23T10:22:57.100

Thanks Brale_, I'll update my code with your suggestions, it should make it a bit more faster. – JCP – 2019-02-23T14:45:22.823

2

After some research and reading this post, I see where my problem was: I was introducing a full consecutive batch of experiences, selected randomly, yes, but the experiences in the batch were consecutives. After redoing my experience selection method, my DQN is actually working and has reached about +200 points after 400000 experiences (about 500 episodes; only 2-3 hours or training!). Before I couldn't reach that score after days of training. I'll let it train to see if there are something I can improve. Thanks to everyone who tried to help me! I let this answer here just in case someone has the same problem as me.

1

The mistake is, that the Deep-Q-Network isn't able to return the needed action directly but it can make only a spatio-temporal prediction. In the sourcecode of the OP can be read in line 2:

action_values = self.parallel_model.predict(s_batch)


This line means that the output of the neural network is treated as an action which is feed into the controller. This won't work. If the AI controller should play spaceinvaders, the overall system consists of two modules: prediction and controller. It means, that the output of DQN is not able to control the system.

• Oh, Junhyuk, et al. "Action-conditional video prediction using deep networks in atari games." Advances in neural information processing systems. 2015.

• Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing exploration in reinforcement learning with deep predictive models." arXiv preprint arXiv:1507.00814 (2015).

• Leibfried, Felix, et al. "Model-Based Stabilisation of Deep Reinforcement Learning." arXiv preprint arXiv:1809.01906 (2018).

Can you put me an example? Because I don't understand what are you trying to say to me. My CNN outputs are the actions that you can do in the game; I associate the prediction to each action and get the one with maximum probability. – JCP – 2019-02-23T15:19:54.730

This kind of mistake is very common. The q-table contains the actions which controls the game and the programmer is trying to improve the policy without success .A direct-policy doesn't work and it's not described in the deep learning literature. If we're reading the papers carefully, we will notice that the DQN networks are similar to model predictive control. And this has to be realized in the Python sourcecode for successful replicating existing tutorials. – Manuel Rodriguez – 2019-02-23T17:05:25.487

I think I understand. You calculate the q-values with the CNN, right? Then you search with the highest value, because it's supposed to give you the highest reward. Then you associate it with his correspondent action. Should something like my last edit work? – JCP – 2019-02-23T18:09:13.773

1Yes, the description is correct. The cumulative reward is equal to the prediction error of the learned model against the game engine. And improving the score let the agent see more steps into the future, see page 7 “Leibfried/Kushman/Hofmann: A deep learning approach for joint video frame and reward prediction in atari games, arXiv:1611.07078 (2016).” – Manuel Rodriguez – 2019-02-23T21:13:44.860

Ok, I'll let it train tonight with the new function and see if it learn properly. I'll post if it works or not! Anyway I suspect that there is something wrong with my training function, maybe the batches – JCP – 2019-02-23T22:38:05.420

Ok, I left it training for 9-10 hours and don't see any improvements. Any ideas? Maybe my image processing it's wrong? – JCP – 2019-02-24T13:16:31.310