## Stability of value function approximation in policy gradients

1

In DQNs, function approximation of the Q-values is unstable for correlated updates. In policy gradients with a baseline, will the value function of the policy not be plagued by the same correlated updates?

For example, in the REINFORCE with baseline algorithm the updates are applied to each time step in a temporally sequential order. I understand that in policy gradients the goal is to estimate the value of the policy and not necessarily the entire state space; however, in a stochastic environment and/or under a stochastic policy, not all states will be sampled with the same probability leading to overfitting to a specific trajectory meaning the value function will not be useful as a baseline for the other trajectories of the policy. Are there algorithms that shuffle the trajectory before fitting the data and/or collect batches of trajectories and then randomly sample from the batch as in DQNs?

The Process: Each agent (called worker) collects its own experience up to a specified timestep $$t_{max}$$, which is stored in a batch. Then a master network performs a training update by using this batch. After the update each worker resets to an identical network to the master and they start the task all over again. The update can be synchronous (as I describe here using a batch) or asynchronous by training multiple agents with their own parameters and then update asynchronously the master network.