What are the differences between the DQN variants?


There are several variants of the DQN model. For example, double DQN, duelling DQN, prioritized DQN, distributed prioritized DQN, episodic memory DQN, asynchronous n-step DQN and multiple DQN. What are the differences between all these variants? What are their advantages and disadvantages? When should I use one over the other?

I am looking for an answer that (briefly) describes all the variants (that we are aware of) and then compares them.


Posted 2019-03-23T12:38:31.063

Reputation: 19 783



I assume OP understand basics of DQN - Time Difference (TD), target network and replay buffer. My experience is just my observations, it not necessarily generalize.

Some of those variants:

Double DQN: small difference in Time Difference propagation. Instead of propagating for maximum action of target network, Q taken form target network propagated for maximum action for current network. This claim to reduce the extrapolation error of TD, which prone to overestimation. My experience: not really helping.

Dueling DQN: difference in network architecture - Q separated into Advantage (Q(a)-Value) and Value (max_a Q(a)) which calculated separately and summed. My experience: not really helping.

Prioritized DQN: instead of uniform sampling of replay buffer it is sampled with probability depended on the error of TD. My experience: really helping, must have for any implementation.

Episodic memory DQN: this is really ambitious method. It combine memory-augmented network (differentiable dictionary, idea similar to Neural Turing Machine) with DQN. Difficult to implement correctly. My experience is ambiguous - I tried it in much simplified form, it was not quite clear to me if it was helping. I intend to return to it later. It also incur considerable performance cost.

Missing here:

C51 architecture - this could be important advance. It's attempt to get rid of regression in DQN(regressions are hard for NN) and replace it with classification. Could be costly from performance point of view. Some reports claim it no worse then best performing Deep RL methods (with exception of tree-based method like Alpha Zero).


Posted 2019-03-23T12:38:31.063

Reputation: 547