Why Q2 is a more or less independant estimate in Twin Delayed DDPG (TD3)?

3

1

Twin Delayed Deep Deterministic (TD3) policy gradient is inspired by both double Q-learning and double DQN. In double Q-learning, I understand that Q1 and Q2 are independent because they are trained on different samples. In double DQN, I understand that target Q and current Q are relatively independent because their parameters are quite different.

But in TD3, Q1 and Q2 are trained on exactly the same target. If their parameters are initialized the same, there will be no difference in their output and the algorithm will be equal to DQN. The only source of independence/difference of Q2 to Q1 I can tell is the randomness in the initialization of their parameters. But with training on the same target, I thought this independence will become smaller and smaller as they converge to the same target values. So I don't quite understand why TD3 works in combating overestimation in Q-learning.

Luke Guye

Posted 2019-03-24T05:26:49.420

Reputation: 31

Answers

0

I emailed the author of the paper and he replied that randomness in the parameter initialization is the only difference between Q1 and Q2. This difference is enough in practice. Moreover, TD3 method is more concerned with overestimation induced by function approximation error rather than stochasticity in the environment.

Luke Guye

Posted 2019-03-24T05:26:49.420

Reputation: 31