I am referring to the paper T. P. Lillicrap et al, "Continuous control with deep reinforcement learning" where they discuss deep learning in the context of continuous action spaces ("Deep Deterministic Policy Gradient").
Based on the DPG approach ("Deterministic Policy Gradient", see D. Silver et al, "Deterministic Policy Gradient Algorithms"), which employs two neural networks to approximate the actor function
mu(s) and the critic function
Q(s,a), they use a similar structure.
However one characteristic they found is that in order to make the learning converge it is necessary to have two additional "target" networks
Q'(s,a) which are used to calculate the target ("true") value of the reward:
y_t = r(s_t, a) + gamma * Q'(s_t1, mu'(s_t1))
Then after each training step a "soft" update of the target weights
w_mu', w_Q' with the actual weights
w_mu, w_Q is performed:
w' = (1 - tau)*w' + tau*w
tau << 1. According to the paper
This means that the target values are constrained to change slowly, greatly improving the stability of learning.
So the target networks
Q' are used to predict the "true" (target) value of the expected reward which the other two networks try to approximate during the learning phase.
They sketch the training procedure as follows:
So my question now is, after the training is complete, which of the two networks
mu' should be used for making predictions?
Equivalently to the training phase I suppose that
mu should be used without the exploration noise but since it is
mu' that is used during the training for predicting the "true" (unnoisy) action for the reward computation, I'm apt to use
Or does this even matter? If the training was to last long enough shouldn't both versions of the actor have converged to the same state?