Can Q-learning be used in a POMDP?


Can Q-learning (and SARSA) be directly used in a Partially Observable Markov Decision Process (POMDP)? If not, why not? My intuition is that the policies learned will be terrible because of partial observability. Are there ways to transform these algorithms so that they can be easily used in a POMDP?


Posted 2019-04-03T02:40:29.227

Reputation: 268

I think answers so far are missing the fact that classifying a problem as a POMDP is not a simple binary choice. There are both types and degrees of POMDP. For instance, Open AI Gym's LunarLander-v2 is strictly a POMDP as the agent does not observe the mountains which vary on each episode, and can have a strong effect on the outcome. However, a straightforward DQN can solve it well. Do you have a specific problem in mind when asking this question? Are you interested in these degrees of POMDP, or should we assume a difficult one, like Poker?

– Neil Slater – 2019-04-03T09:33:10.300


In the paper Deep Recurrent Q-Learning for Partially Observable MDPs, the DQN (an extension of Q-learning) is further extended to handle the partially observed nature of a single frame of an Atari 2600 game.

– Philip Raeisghasem – 2019-04-03T09:49:32.287

@NeilSlater What do you mean by "degrees"? Do you mean that full information about states are sometimes available even in the context of POMDPs? I don't get how this is strictly related to the question. – nbro – 2019-04-03T09:58:24.277

@nbro: I mean there is more than one way for a system to be a POMDP. Philip linked a paper above that I was going to refer as well - one way for an environment to be a POMDP is to have states that do not summarise history fully, but where actual state is relatively easy to infer from history - and you can see that DQN can be extended using a RNN in place of regular NN, without many other changes for that kind of environment. There are other ways that a system can be a POMDP, for example in some games, the environment will use additional knowledge against the agent, which is harder to solve. – Neil Slater – 2019-04-03T11:44:01.973

@NeilSlater Yes, I am aware of this. I added a paragraph to my answer that is related to this. – nbro – 2019-04-03T11:45:26.370

1@nbro: That's fine. It would be hard to cover all the ways a system can be a POMDP (and would require a different sort of answer). In quite a few cases, extensions to DQN may not even be required, or can be straightforward. Which is why I am asking OP for clarification. – Neil Slater – 2019-04-03T11:47:08.613

@NeilSlater I don't think the question needs more clarification. In general, Q-learning assumes that the current state is Markov and accessible, but this is not the case in POMDP, so Q-learning (tabular one, at least) shouldn't be directly applicable to POMDP. Nowadays, people use NNs to represent the Q functions, so a lot of approximations can be done, and Q-learning or variations of it are applied to partially observable problems/environments (where some engineering is made in order to build/represent artificial states). – nbro – 2019-04-03T11:50:03.080

@nbro: There's more to it than that, and I think clarification would help. – Neil Slater – 2019-04-03T11:51:43.503

@NeilSlater I think you brought up a good point that I wasn’t not aware of. Then I hope to know how applicable Q learning and SARSA are to each type and degree of POMDP. – drerD – 2019-04-03T16:59:01.910



The usual (as presented in Reinforcement Learning: An Introduction) $Q$-learning and SARSA algorithms use (and update) a function of a state $s$ and action $a$, $Q(s, a)$. These algorithms assume that the current state $s$ is known. However, in POMDP, at each time step, the agent does not know the current state, but it maintains a "belief" (which, mathematically, is represented as a probability distribution) in what the current state might be, so it cannot maintain (an approximation of) the function $Q(s, a)$. Hence, the usual Q-learning and SARSA algorithms shouldn't be directly applicable to a POMDP.

However, $Q$-learning is often used in the contexts where observations emitted by the environment (or transformations of the raw observations) are used to build the current state (which is assumed to be Markov, even if it is not). For example, in the original DQN, the action taken at the current step and the raw observation and the reward emitted by the environment (after this action is taken) are combined to produce the current (Markov) state. It might not be the case that the way they combine the action, the reward and the observation is sufficient to fully describe the current state (which might not even be Markov).

In this report, Deep Reinforcement Learning with POMDPs, the author attempts to use Q-learning in a POMDP setting. He suggests to represent a function, either $Q(b, a)$ or $Q(h, a)$, where $b$ is the "belief" over the states and $h$ the history of previously executed actions, using neural networks. So, the resulting parameterized functions would be denoted by $Q(b, a; \theta)$ or $Q(h, a; \theta)$, where $\theta$ is a vector representing the parameters of the corresponding neural network. Essentially, the author uses a DQN (with an experience replay buffer and target network), but the results are not great: $Q$ values converge, but policies do not and they are not robust (in that they are sensitive to small perturbations).


Posted 2019-04-03T02:40:29.227

Reputation: 19 783

Another approach (can't recall paper) is to factorize state space into observability classes, states of the same class have the same observation, assign reward and transition distributions to those classes, after that work with them as observable MDP – mirror2image – 2019-04-03T10:10:44.277

@mirror2image any reference to that? – drerD – 2020-04-04T02:34:51.163

@drerD google POMDP infosets (also infosets determinization) – mirror2image – 2020-04-06T07:48:01.400

@mirror2image You can provide a formal answer! I encourage you to do it. – nbro – 2020-04-06T12:47:22.007