I'm new to Reinforcement Learning. For an internship, I am currently training Atari's "Montezuma's Revenge" using a double Deep Q-Network with Hindsight Experience Replay (HER).
HER is supposed to alleviate the reward sparseness problem. But since the reward is annoyingly too sparse, I have also added a Random Network Distillation (RND) to encourage the agent to explore new states, by giving it a higher reward when it reaches a previously undiscovered state and a lower reward when it reaches a state it has previously visited multiple times. This is the intrinsic reward I add to the extrinsic reward the game itself gives. I have also used a decaying greedy epsilon policy.
How well should this approach work? Because I've set it to run for 10,000 episodes, and the simulation is quite slow, because of the mini-batch gradient descent step in HER. There are multiple hyperparameters here. Before implementing RND, I considered shaping the reward, but that is just impractical in this case. What can I expect from my current approach? OpenAI's paper on RND cites brilliant results with RND on Montezuma's Revenge. But they obviously used PPO.
Here is a link you would find useful for RND.
Here is OpenAI's paper on Random Network Distillation (RND)
Here is the paper for Hindsight Experience Replay.
Here is a blog I found useful to understand HER.