Is this a good approach to solving Atari's "Montezuma's Revenge"?


I'm new to Reinforcement Learning. For an internship, I am currently training Atari's "Montezuma's Revenge" using a double Deep Q-Network with Hindsight Experience Replay (HER).

HER is supposed to alleviate the reward sparseness problem. But since the reward is annoyingly too sparse, I have also added a Random Network Distillation (RND) to encourage the agent to explore new states, by giving it a higher reward when it reaches a previously undiscovered state and a lower reward when it reaches a state it has previously visited multiple times. This is the intrinsic reward I add to the extrinsic reward the game itself gives. I have also used a decaying greedy epsilon policy.

How well should this approach work? Because I've set it to run for 10,000 episodes, and the simulation is quite slow, because of the mini-batch gradient descent step in HER. There are multiple hyperparameters here. Before implementing RND, I considered shaping the reward, but that is just impractical in this case. What can I expect from my current approach? OpenAI's paper on RND cites brilliant results with RND on Montezuma's Revenge. But they obviously used PPO.

Here is a link you would find useful for RND.

Here is OpenAI's paper on Random Network Distillation (RND)

Here is the paper for Hindsight Experience Replay.

Here is a blog I found useful to understand HER.


Posted 2020-03-13T10:58:40.600

Reputation: 81

No answers