2

1

From what I have seen, any results involving RL almost always take a massive number of simulations to reach a remotely good policy.

Will any form of RL be viable for real-time systems?

2

1

From what I have seen, any results involving RL almost always take a massive number of simulations to reach a remotely good policy.

Will any form of RL be viable for real-time systems?

1

Short answer: **Yes, it is.**

**Explanation**

Reinforcement learning can be considered as a online learning. That is, you can train your model with a single data/reward pairs. As with any online learning algorithm, there are a few things to consider.

The model tends to forget the knowledge gained. To overcome this problem, one can save new data in a circular buffer called history and train the model with a portion of mix of new and old data. This is actually the common way to train an RL model and can be adopted to real-time systems. There are also others techniques to overcome it.

Another problem is that if only one data point is fed to the network, it will be impossible to apply some techniques, such as Batch normalization.

1This doesn't address the massive, massive number of simulations required. IIRC the number of walking simulations for robots is way, way higher than the real time constraints imposed on animals learning to walk. – FourierFlux – 2020-02-29T19:43:40.687

Where does this requirement come from? You can pre-train your model and then apply online learning so that it is always up to date. – Aray Karjauv – 2020-02-29T19:54:49.273

1The issue is IMO, real systems have built in priors I think(or models which better lead to correct parameter estimation) and it seems this is missing from the RL paradigm. Like the convergence rate for animal/human behavior is astronomically faster than any type of RL policy. – FourierFlux – 2020-02-29T20:22:34.053

All this simulation are required to collect data. Theoretically, your model have to visit all states of the world and learn reward for each action in given state. If we train a model using Q-learning algorithm, it doesn't matter how you train your model. You have a bunch of `(input, action, reward)`

tuples. You may feed this tuple as long as you get new one. In this case a network just an approximation of a Q-function

1But do you dispute the fact that humans don't need nearly the same number of trials to reach a viable policy? – FourierFlux – 2020-02-29T20:45:29.923

We don't know how our brain works. Q-learning has Markov assumption. It means that current state doesn't depend on previous states, which means that the model doesn't have memory. – Aray Karjauv – 2020-02-29T20:56:42.177

1Humans may not need as much online training as we are capable of transferring knowledge from different domains to help in achieving the task – KaneM – 2020-03-01T12:37:34.980

So do normal ML algos. They take a lot of time to train. RL also takes a lot of time to train. But at testing time you are only doing a single forward propagation which takes very less time compared to the backprop operation in training plus it is performed repeatedly over the data-set. – DuttaA – 2020-02-29T14:00:27.687