Is there a way to train an RL agent without any environment?


Following Deep Q-learning from Demonstrations, I'd like to avoid potentially unsafe behavior during early learning by making use of supervised learning with demonstration data. However, the implementation I'm following still uses an environment. Can I train my agent without an environment at all?


Posted 2019-03-06T10:41:05.043

Reputation: 187

Most of this question is off-topic for this community, as we don't address implementation details. We can help you understand that paper and general things about environments, though. – Philip Raeisghasem – 2019-03-11T06:44:08.607

Sorry, I'm newbie here. I addressed the implementation to make people see that people still uses Environments to "learn from Demonstrations" while the devil is in learning only from a certain dataset, isn't it? – Angelo – 2019-03-11T08:56:21.950

I didn't mention you. Sorry again @PhilipRaeisghasem. – Angelo – 2019-03-11T09:17:56.173

Hah, it's ok. I'm considering how to best answer your question. – Philip Raeisghasem – 2019-03-11T09:20:25.067



There are many techniques for training an RL agent without explicitly interacting with an environment, some of which are cited in the paper you linked. Heck, even using experience replay like in the foundational DQN paper is a way of doing this. However, while many models utilize some sort of pre-training for the sake of safety or speed, there are a couple of reasons why an environment is also used whenever possible.

Eventually, your RL agent will be placed in an environment to take its own actions. This is why we train RL agents. I'm assuming that, per your question, learning does not happen during this phase.

Maybe your agent encounters a novel situation
Hopefully, the experience your agent learns from is extensive enough to include every possible state-action pair $(s,a)$ that your agent will ever encounter. If it isn't, your agent won't have learned about these situations, and it will always perform suboptimally in them. This lack of coverage over the state-action space could be caused by stochasticity or nonstationarity in the environment.

Maybe the teacher isn't perfect
If you don't allow your agent to learn from its own experience, it will only ever perform as well as the agent that collected the demonstration data. That's an upper bound on performance that we have no reason to set for ourselves.

Philip Raeisghasem

Posted 2019-03-06T10:41:05.043

Reputation: 1 613

Thanks a lot Philip. More specifically, I'm planning to pre-train a recommender system from a generated dataset using user embeddings and last interactions as the state and the items to recommend as the actions to take. Therefore, I'm very concerned about including every possible state-action par, because although the states may be very close one from another, it won't be the same exactly. Does it really matter? – Angelo – 2019-03-11T10:15:55.283

It depends on how well your model can generalize between similar states. More training data always helps with this. Regularization could also help. But, again, it's very possible your system needs to learn from its own experience--through trial and error after deployment--in order to learn optimal behavior.

– Philip Raeisghasem – 2019-03-11T11:08:22.033

Of course, I'm only talking about the pre-deployment phase. Thanks a lot. – Angelo – 2019-03-11T13:25:12.533