RL PPO Algorithm: Understanding the Value Function Loss term in PPO by OpenAI


In the Schulman 2017 PPO Paper, there is a value function loss term in the final loss in equation 9, where they state that the value function loss is the MSE of the target value and predicted value.

My question is, how do you compute the $V_t^{Target}$ term? I'm guessing it's the return or collected sum of rewards. Would that be discounted like

$V_t^{target} = \sum_{i=t}^T \gamma^{(i-t)} r_i$,

or $V_t^{target} = \sum_{i=t}^T r_i$,

or neither?


Posted 2021-01-19T07:02:42.817

Reputation: 1

No answers