7

1

Many examples work with a table based method for Q-Llearning. This may be suitable for discrete state(observation) or actions like a robot in a grid world but is there a way to use Q-Learning for continous spaces like the control of a pendulum?

7

1

Many examples work with a table based method for Q-Llearning. This may be suitable for discrete state(observation) or actions like a robot in a grid world but is there a way to use Q-Learning for continous spaces like the control of a pendulum?

4

Yes, this is possible, provided you use some mechanism of approximation. One approach is to discretise the state space, and that doesn't have to reduce the space to a small number of states. Provided you can sample and update enough times, then a few million states is not a major problem.

However, with large state spaces it is more common to use some form of function approximation for the action value. This is often noted $\hat{q}(s,a,\theta)$ to show that it is both an estimate (the circumflex over $\hat{q}$) and that you are learning some function parameters ($\theta$). There are broadly two popular approaches to Q-learning using function approximation:

Linear function approximation over a processed version of the state into features. A lot of variations to generate features have been proposed and tested, including Fourier series, tile coding, radial basic functions. The advantage of these methods are that they are simple, and more robust than non-linear function approximations. Which one to choose depends on what you state space represents and how the value function is likely to vary depending on location within the state space.

Neural network function approximation. This is essentially what Deep Q Networks (DQN) are. Provided you have a Markov state description, you scale it to work sensibly with neural networks, and you follow other DQN best practices (experience replay table, slow changing target network) this can work well.

Unless you discretise the action space, then this becomes very unwieldy.

The problem is that, given $s,a,r,s'$, Q-learning needs to evaluate the TD target:

$$Q_{target}(s,a) = r + \gamma \text{max}_{a'} \hat{q}(s',a',\theta)$$

The process for evaluating the maximum becomes less efficient and less accurate the larger the space that it needs to check.

For somewhat large action spaces, using double Q-learning can help (with two estimates of Q, one to pick the target action, the other to estimate its value, which you alternate between on different steps) - this helps avoid maximisation bias where picking an action because it has the highest value and then using that highest value in calculations leads to over-estimating value.

For very large or continuous action spaces, it is not usually practical to check all values. The alternative to Q-learning in this case is to use a policy gradient method such as Actor-Critic which can cope with very large or continuous action spaces, and does not rely on maximising over all possible actions in order to enact or evaluate a policy.

For a discrete action space e.g. applying one of a choice of forces on each time step, then this can be done using a DQN approach or any other function approximation. The classic example here might be an environment like Open AI's CartPole-v1 where the state space is continuous, but there are only two possible actions. This can be solved easily using DQN, it is something of a beginner's problem.

Adding continuous action space ends up with something like the Pendulum-v0 environment. This can be solved to some degree using DQN and discretising the action space (to e.g. 9 different actions). However, it is possible to make more optimal solutions using an Actor-Critic algorithm like A3C.

0

Q-Learning for continuous state space

Reinforcement learning algorithms (e.g Q-Learning) can be applied to both discrete and continuous spaces. If you understand how it works in discrete mode then you can easily move to continuous mode. That's why in the literature all the introductory material focus on discrete mode as it's easier to model (table, grid ...)

Supposing you have a discrete number of actions, **the only difference** in a continuous space is that you will be modelling the state each X amount of time (X being a number you can choose depending on your use case). So basically you end up with a discrete space, but probably with an infinite number of states. You apply then the same approach you learned for discrete mode.

Let's take the example of self-driving cars, at each X ms (e.g X=1) you'll be computing the state of the car which are your input features (e.g direction, orientation, rotation, distance to the pavement, relative position on the lane...) and take a decision of the action to take as in discrete mode. The approach is the same in other use cases like playing games, walking robot...

Note (continuous action space):

If you have continuous actions then in almost all use cases the best approach is to discretise your actions. I can't think of an example where discretising your actions will lead to a considerable deficiency.

More formally, you are suggesting to discretise the continuous state space. Do you have any references to works/papers that apply a discretization of the state space? I think it might be helpful to cite them, if that's the case.

– nbro – 2019-05-11T11:42:36.107If action space is continuous you could model it by outputting a mean and variance of a proper distribution and then use the outputs to sample from the distribution parameterized by those outputs (ie. applying voltages). – Hanzy – 2019-05-11T12:17:46.457

For references using discretisation you can take a look at the Arcade Learning Environment paper which uses frame-skipping techniques (also experimented in the well-known Playing Atari with Deep Reinforcement Learning paper). Current RL papers having video as input do the process at the level of each frame thanks to computation power being available.

– HLeb – 2019-05-11T12:23:38.590