3

I have recently started working on a control problem using a Deep Q Network as proposed by DeepMind (https://arxiv.org/abs/1312.5602). Initially, I implemented it without Experience Replay. The results were very satisfying, although after implementing ER, the results I got were relatively bad. Thus I started experimenting with BATCH SIZE and MEMORY CAPACITY.

(1) I noticed that if I set BATCH SIZE = 1 and MEMORY CAPACITY = 1 i.e. the same as doing normal online learning as previously, the results are then (almost) the same as initially.

(2) If I increased CAPACITY and BATCH SIZE e.g. CAPACITY = 2000 and BATCH SIZE = 128, the Q Values for all actions tend to converge to very similar negative values.

A small negative reward -1 is received for every state transition except of the desired state which receives +10 reward. My gamma is 0.7. Every state is discrete and the environment can transition to a number of X states after action a, with every state in X having a significant probability.

Receiving a positive reward is very rare as getting to a desired state can take a long time. Thus, when sampling 128 experiences if 'lucky' only a small amount of experiences may have a positive reward.

Since, when doing mini-batch training we average the loss over all the samples and then update the DQN I was wondering whether generally the positive rewards can become meaningless as they are 'dominated' by the negative ones. Which means that this would result in a very slower convergence to actual values ? And also justifies the the convergence to similar negative values as in (2) ? Is this something expected? I am looking to implement. Prioritised ER as a potential solution to this, but is there something wrong inn the above logic?

I hope this does makes sense. Please forgive me if I make a wrong assumption above as I am new to the field.

**Edit**: The problem seemed to be that indeed finding rewards very rarely would result in sampling almost never, especially at the begging of training, which in turn resulted in very slow convergence to the actual Q values. The problem was successfully solved using Prioritised ER -but I believe any form of careful Stratified Sampling would result in good results

Just curious, are you using a duplicate network to store the network weights after every $C$ updates, and using the duplicate network to calculate the expected value of state action pairs for the DQN update error term? (Ie “q-hat”) – Hanzy – 2019-04-29T15:48:33.277

Thank you for your comment. If I understand correctly, yes I am using a target net (fixed Q values) which I update every. 10 episodes. – George Papagiannis – 2019-04-29T15:53:41.130

sorry if I ask more questions, just trying to get a sense of your implementation. When you used batch size = 1 and capacity = 1, did you also use the same target network updated after 10 episodes? When doing the “online” version you are really training on policy rather than off policy, since every update is following a trajectory. So that’s why I’m curious to know what factors changed between the two. Off policy learning with function approximation and bootstrapping (what Sutton / Barto call the “deadly triad”) is known to face issues of stability. – Hanzy – 2019-04-29T16:26:10.747

Yes, so when batch size = 1 and capacity = 1 I use the exact same configuration. I have the main policy net Q and target net Q_hat. I get the action to perform using Q. I calculate the error using Q_hat for the new state and I update Q_hat every 10 episodes. I sample from ER every time I perform an action on the environment and train – George Papagiannis – 2019-04-29T16:30:36.383

this sounds like an off-policy problem to me. What you proposed seems reasonable... since you’re training off policy and sampling out of trajectory AND averaging over the updates, it seems reasonable it could effect your performance. This is a bit of a hack, but can you try making your reward for goal states >= batch size? I’ll expand more of this idea in an answer since I’m limited by characters here. – Hanzy – 2019-04-29T16:52:30.110

Thank you, much appreciated. – George Papagiannis – 2019-04-29T17:01:37.330

I hope what I said helps some. Without digging into your code more I may be missing some part of what you’re seeing, but it’s how I interpret what you describe. – Hanzy – 2019-04-29T17:25:27.350