Why is the target $r + \gamma \max_{a'} Q(s', a'; \theta_i^-)$ in the loss function of the DQN architecture?



In the paper Human-level control through deep reinforcement learning, the DQN architecture is presented, where the loss function is as follows

$$ L_i(\theta_i) = \mathbb{E}_{(s, a, r, s') \sim U(D)} \left[ \left( r + \gamma \max_{a'} Q(s', a'; \theta_i^-) - Q(s, a; \theta) \right)^2\right] $$

where $r + \gamma \max_{a'} Q(s', a'; \theta_i^-)$ approximates the "target" of $Q(s, a; \theta)$. But it is not clear to me why. How can existing weights approximate the target (ground truth)? Isn't $r$ is a sample from a the experience replay dataset? Is $r$ a scalar value?


Posted 2017-12-13T19:25:59.450

Reputation: 537



This is the problem that reinforcement learning (RL) is trying to solve: What is the best way to behave when we don’t know what the right action is and only have a scalar (the reward (r) is a scalar) reward of how well we have done?

RL approaches this problem by utilizing temporal difference learning and makes predictions based on the previous experience. An RL agent is trying to maximize the sum of future discounted rewards, called the return.

The term $r + \gamma \max_{a'} Q(s', a'; \theta_i^-)$ is essentially saying "the reward I just saw + $\gamma$ * (my prediction of the return given that I take what I think is the best action in the current state and follow my policy from then on)".

By updating the Q function, the agent can better predict the consequences of its actions and can then choose the best action with greater probability.

The $\gamma$ (gamma) helps to balance between the immediate reward and future rewards. A $\gamma = 0$ makes the immediate reward the only thing that matters but usually the reward for a good action is delayed so values of gamma which put a higher importance on later rewards ($\gamma = .8, .9, .99$, etc) are used.

Jaden Travnik

Posted 2017-12-13T19:25:59.450

Reputation: 3 242

my prediction of the return *given that I take what I think is the best action in the current state and follow my policy (theta) from then on* why do we need this disclaimer? Why can't it just be the predicted return? – echo – 2017-12-14T22:45:58.547

Because we want the return to be based on our current policy so that we can improve our policy. – Jaden Travnik – 2017-12-14T22:52:12.107

1That doesn't make sense to me. We want the return to have a recursive formulation? And by having a recursive formulation, we can improve our policy? – echo – 2017-12-14T23:11:56.257

It’s recursive yes because it’s based on the previous steps. Reading chapter 6 of the Introduction to RL book by Sutton and Barto should help. http://ufal.mff.cuni.cz/~straka/courses/npfl114/2016/sutton-bookdraft2016sep.pdf

– Jaden Travnik – 2017-12-14T23:57:54.367

Very instructive breakdown of the concept! – DukeZhou – 2017-12-21T19:56:43.803