Why is the access to the dynamics model unrealistic in Q-Learning?


Pieter Abbeel says that having access to the dynamics model, $P(s' \mid s,a)$, is unrealistic because it assumes we know the probability that we will reach all future states.

I don't understand how this is unreasonable? Could someone explain this to me like in a simple way?


Posted 2017-12-13T13:26:14.637

Reputation: 537



It is an unreasonable assumption for many scenarios to know the full transition model of an environment.

In the deterministic toy scenarios used for explaining RL, such as a grid world, it seems like a trivial requirement. However, it starts to get harder very quickly as problems and state features get more complex.

Even in relatively simple environments that are possible to model in this way, the calculations start to become difficult enough that there is an argument to not bother. For instance, a common toy example is the game of Blackjack - in order to calculate the probability of different rewards after sticking you would need to resolve the remaining game tree of typically 0,1,2, but up to more, for the dealer, accounting for their starting card, and perhaps accounting for which cards have been drawn so far (if you are simulating this more realistically, without replacement).

However, that could be resolved if we wanted. To make things harder, you need to consider agents that are set to solve problems where we don't have any meaningful probabilistic model of how the state will progress.

A good example would be any real world agent that used a vision system as part of its state. Consider a flying drone set with the task of locating and tracking migrating whales over some area of the ocean. It is rewarded for capturing footage of whales. Its state is the current image plus GPS and telemetrics data.

Knowing $P(s' \mid s,a)$ in the whale-tracking example implies that you would know the probability distribution of the next video frame, somehow accounting for the chaotic movement of waves, the likely behaviour of any creatures in shot, and effects of wind turbulence on drone position etc. Even the most sophisticated physics simulations with ray-tracing rendering engines cannot rise to that challenge with much accuracy, and the computation required even for a rough render of expected image(s) based on a physics engine would be far beyond anything you could run in real time on the drone. However, it is still possible to build an agent which learns how to solve this kind of environment through gathering statistics on its task using images as state data.

Scaling that challenge back a bit, you can view many of the Atari game-playing challenges like this. The input state is a collection of still frames from the game. Knowing $P(s' \mid s,a)$ implies knowing the probability distribution of next set of frame images, given the current inputs. This is a near impossible challenge, even if you know the code for the game, providing an analytical result for $P(s' \mid s,a)$ is unrealistic - the best way to do it would be to have a memory clone of the game that you could use to look ahead then reset. Whilst this is theoretically possible in a computer game, it is not something you can consider when interacting with the real world.

Neil Slater

Posted 2017-12-13T13:26:14.637

Reputation: 14 632


In the paper Machine Learning for Helicopter Dynamics Models the problem is described as “system identification”. The aim of machine learning is, to find a relationship between current situation and a goal state. It is unrealistic to find a transition function because the state-space is too big. In such papers, only a small subset of the whole state-space is visualized. In reality, the helicopter has at least 10 variables, 10 buttons, a time variable and the overall problem is way way bigger than solving the travelling salesman problem. Even with Matlab and Q-learning it is impossible to find the policy.

So what's the deal, why is the Stanford helicopter flying? The answer is, that Pieter Abbeel enriches the reinforcement learning model with an ontology. In literature this concept is sometimes called language grounding. Emergence of Grounded Compositional Language in Multi-Agent Populations

Manuel Rodriguez

Posted 2017-12-13T13:26:14.637

Reputation: 1

Welcome to AI! That paper on Emergence of Grounded Compositional Language in Multi-Agent Population looks quite tasty. (We've had several questions on how AI's might develop their own languages, if interested.) – DukeZhou – 2017-12-13T17:41:47.123

@DukeZhou could you point me to one of those questions? – echo – 2017-12-13T17:47:37.427

How would an AI learn language?. Can an AI make a constructed (natural) language?. I seem to recall there was a question about chatbots developing their own intra-bot language, sparked by a new story about the same re: Facebook, but I'm still trying to track that one down. (The news coverage turned out to be overblown, but the underlying idea is quite interesting nonetheless!) – DukeZhou – 2017-12-13T18:00:21.480