To give a little background, I've been reading the COBRA paper, and I've reached the section that talks about the exploration policy, in particular. We figure that a uniformly random policy won't do us any good, since the action space is sparsely populated with objects that the agent must act upon - and a random action is likely to result in no change (an object here occupies only about 1.7% of the space of the screen). Hence we need our agent to learn in the exploration phase a policy that clicks on and moves objects more frequently.
I get that a random policy won't work, but I've difficulty understanding how and why the transition model is trained adversarially. Following is the extract which talks about the same, and I've highlighted parts that I don't completely understand -
"Our approach is to train the transition model adversarially with an exploration policy that learns to take actions on which the transition model has a high error. Such difficult-to-predict actions should be those that move objects (given that others would leave the scene unchanged). In this way the exploration policy and the transition model learn together in a virtuous cycle. This approach is a form of curiosity-driven exploration, as previously described in both the psychology (Gopnik et al., 1999) and the reinforcement learning literature (Pathak et al., 2017; Schmidhuber, 1990a,b)."
- How does it help to take actions on which the transition model has a high error?
- I don't exactly see how a virtuous cycle is in action
Could someone please explain? Thanks a lot!