My best guess that it's been done to reduce the computation time, otherwise we would have to find out the q value for each action and then select the best one.

It has no real impact on computation time, other than a slight increase (due to extra memory used by two networks). You *could* cache results of the target network I suppose, but it probably would not be worth it for most environments, and I have not seen an implementation which does that.

Am I missing something?

It is to do with stability of the Q-learning algorithm when using function approximation (i.e. the neural network). Using a separate target network, updated every so many steps with a copy of the latest learned parameters, helps keep runaway bias from bootstrapping from dominating the system numerically, causing the estimated Q values to diverge.

Imagine one of the data points (at $s, a, r, s'$) causes a currently poor over-estimate for $q(s', a')$ to get worse. Maybe $s', a'$ has not even been visited yet, or the values of $r$ seen so far is higher than average, just by chance. If a sample of $(s, a)$ cropped up multiple times in experience replay, it would get worse again each time, because the update to $q(s,a)$ is based on the TD target $r + \text{max}_{a'} q(s',a')$. Fixing the target network limits the damage that such over-estimates can do, giving the learning network time to converge and lose more of its initial bias.

In this respect, using a separate target network has a very similar purpose to experience replay. It stabilises an algorithm that otherwise has problems converging.

It is also possible to have DQN with "double learning" to address a separate issue: Maximisation bias. In that case you may see DQN implementations with 4 neural networks.

For additional reading, one can refer to http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/12389/11847

– amitection – 2018-08-06T17:00:51.483