In DQNs, function approximation of the Q-values is unstable for correlated updates. In policy gradients with a baseline, will the value function of the policy not be plagued by the same correlated updates?
For example, in the REINFORCE with baseline algorithm the updates are applied to each time step in a temporally sequential order. I understand that in policy gradients the goal is to estimate the value of the policy and not necessarily the entire state space; however, in a stochastic environment and/or under a stochastic policy, not all states will be sampled with the same probability leading to overfitting to a specific trajectory meaning the value function will not be useful as a baseline for the other trajectories of the policy. Are there algorithms that shuffle the trajectory before fitting the data and/or collect batches of trajectories and then randomly sample from the batch as in DQNs?