3

1

Twin Delayed Deep Deterministic (TD3) policy gradient is inspired by both double Q-learning and double DQN. In double Q-learning, I understand that Q1 and Q2 are independent because they are trained on different samples. In double DQN, I understand that target Q and current Q are relatively independent because their parameters are quite different.

But in TD3, Q1 and Q2 are trained on exactly the same target. If their parameters are initialized the same, there will be no difference in their output and the algorithm will be equal to DQN. The only source of independence/difference of Q2 to Q1 I can tell is the randomness in the initialization of their parameters. But with training on the same target, I thought this independence will become smaller and smaller as they converge to the same target values. So I don't quite understand why TD3 works in combating overestimation in Q-learning.