3

# The situation

I am referring to the paper T. P. Lillicrap et al, "Continuous control with deep reinforcement learning" where they discuss deep learning in the context of continuous action spaces ("Deep Deterministic Policy Gradient").

Based on the DPG approach ("Deterministic Policy Gradient", see D. Silver et al, "Deterministic Policy Gradient Algorithms"), which employs two neural networks to approximate the actor function `mu(s)`

and the critic function `Q(s,a)`

, they use a similar structure.

However one characteristic they found is that in order to make the learning converge it is necessary to have two additional "target" networks `mu'(s)`

and `Q'(s,a)`

which are used to calculate the target ("true") value of the reward:

```
y_t = r(s_t, a) + gamma * Q'(s_t1, mu'(s_t1))
```

Then after each training step a "soft" update of the target weights `w_mu', w_Q'`

with the actual weights `w_mu, w_Q`

is performed:

```
w' = (1 - tau)*w' + tau*w
```

where `tau << 1`

. According to the paper

This means that the target values are constrained to change slowly, greatly improving the stability of learning.

So the target networks `mu'`

and `Q'`

are used to predict the "true" (target) value of the expected reward which the other two networks try to approximate during the learning phase.

They sketch the training procedure as follows:

# The question

So my question now is, after the training is complete, which of the two networks `mu`

or `mu'`

should be used for making predictions?

Equivalently to the training phase I suppose that `mu`

should be used without the exploration noise but since it is `mu'`

that is used during the training for predicting the "true" (unnoisy) action for the reward computation, I'm apt to use `mu'`

.

Or does this even matter? If the training was to last long enough shouldn't both versions of the actor have converged to the same state?