I was reading the following research paper Hindsight Experience Replay. This is the paper that introduces a concept called Hindsight Experience Replay (HER), which basically attempts to alleviate the infamous sparse reward problem. It is based on the intuition that human beings constantly try and learn something useful even from their past failed experiences.
I have almost completely understood the concept. But in the algorithm posited in the paper, I don't really understand how the optimization works. Once the fictitious trajectories are added, we have a state-goal-action dependency. This means our DQN should predict Q-Values based on an input state and the goal we're pursuing (The paper mentions how HER is extremely useful for Multi RL as well).
Does this mean I need to add another input feature (goal) to my DQN? An input state and an input goal, as two input features to my DQN, which is basically a CNN?
Because in the optimization step they have mentioned that we need to randomly sample trajectories from the replay buffer and use those for computing the gradients. It wouldn't make sense to compute the Q-Values without the goal now, because then we'd wind up with duplicate values.
Could someone help me understand how exactly does the optimization takes place here?
I am training Atari's "Montezuma's Revenge" using a double DQN with Hindsight Experience Replay (HER).