A3C - Turning action probabilities into intensities



I'm experimenting with using an A3C network to learn to play old Atari video games. My network outputs a set of probabilities for each possible action (e.g. left, right, shoot), and I use this information to determine which action to take in the game.

However I have gotten to thinking how would one go about playing a game with non-binary actions. For example, steering a car left or right with a wheel instead of keyboard keys. I thought about simply translating probabilities into intensities (e.g. if I have values of 1.0/0.0 for left right, then make the hardest left turn possible, but make a much more gradual turn if my values are 0.6/0.4) but I'm not sure if this makes sense mathematically, or even in practice.

Is there a standard approach to doing this?

Levi Botelho

Posted 2018-01-27T12:49:50.810

Reputation: 133



In general, with Policy Gradients (PG) you can have two 'flavors' of them: Stochastic and Deterministic.

In your case, you can build a NN that outputs continuous actions or one that approximates the sufficient statistics of a probability distribution that you will sample the actions from. The main references are: Theoretical Proofs and NNs with DPG.

There are no rules on which one to use. There are many examples in the papers I cite that DPG outperforms SPG but also the opposite. Sometimes you really want the output to be deterministic. For example, in pricing: If a client comes in a store and you use SGD for assigning prices, the client will encounter every single time different prices (small differences but still differences).

In my personal experience I find it a bit hard to stabilize the DPG with NNs and you need also to come up with a good exploration strategy. However, once the network is stabilized you get the continuous action values that could control your vehicle. This is a detailed example of controlling a car with DPG using tensorflow and Keras.

For general theoretical exploration about the two frameworks I suggest you this Master Thesis. It provides a concise overview of the two methods (no NN implementations). You can look up also the answer I gave here if you decide to get on board with PGs.


Posted 2018-01-27T12:49:50.810

Reputation: 1 531

Thanks for the answer! Could you perhaps elaborate a bit on using deterministic or stochastic PGs? Would a deterministic one function as I suggested in my question? Are there differences in the use cases for each variant? – Levi Botelho – 2018-01-28T13:02:57.913

1Levi, in my answer I give you a link of a blog with a detailed implementation of a Deep Deterministic Policy Gradient algorithm for controlling a car. I think this will help you a lot! – Constantinos – 2018-01-28T18:19:29.853


The standard approach with policy gradients for continuous action spaces is to output a vector of parameters to a probability distribution. To resolve the policy into an action for the agent, you then sample from the distribution.

In policy gradients with discrete action spaces, this is actually already the case - the softmax layer is providing a discrete distribution for you, that you must sample from to choose the action.

The general rule is that your probability distribution function needs to be differentiable. A common choice is the Normal distribution, and for the output vector to be the mean and standard deviation. This adds an extra "interpretation layer" to the agent's model in addition to the NN, which needs to be included in the gradient calculation.

Your idea:

e.g. if I have values of 1.0/0.0 for left right, then make the hardest left turn possible, but make a much more gradual turn if my values are 0.6/0.4

. . . is almost there. However, you need to interpret the output values stochastically, not deterministically, in order to use policy gradients. A deterministic output based on your parameters has no gradient wrt to improvements to the policy, so the policy cannot be adjusted*. Another way to think of this is that policy gradient methods must have exploration built into the policy function.

It would be quite difficult to turn the left/right outputs you have into a PDF that could be made progressively tighter around the optimal value as the agent homed in on the best actions, so I would instead suggest the common mean, standard deviation split for this, and have the environment cut off the actions at min/max steer if the sampled action ended up as e.g. hard left times 1.7

* Actually this is incorrect as pointed out in Constantinos' answer. There are deterministic policy gradient solvers, and sometimes they are better. They work by learning off-policy. Your network could simply output a steering direction -1.0 to 1.0, but you would also need a behaviour policy which added some randomness to this output in order to learn.

I also think you would need to switch from A3C to A2C in order to take advantage of deterministic policy gradient solvers.

Neil Slater

Posted 2018-01-27T12:49:50.810

Reputation: 24 613

1This is not true. Deterministic policy gradients exist and you can take as reference the paper with title: Deterministic policy gradients in which you can read the proofs. The paper titled Continuous control with Deep Deterministic Policy Gradients provides examples with NNs that output continuous actions. – Constantinos – 2018-01-28T07:56:47.480

@Constantinos: I could not find a paper with that title. Do you mean this one: https://arxiv.org/abs/1509.02971 ? There is also this: https://deepmind.com/research/publications/deterministic-policy-gradient-algorithms/ - I'm happy to admit I'm wrong - just reading it now to see how it solves exploration and gradient issues from being single valued output . . .

– Neil Slater – 2018-01-28T08:22:56.457

Yes Neil!These are the papers. I wasn't sure how to include them as a link in the comments. I will elaborate a bit more in my answer. There I can put the links. The DPG exist at the limit of the stochastic one. To use it you need a good exploration stochastic process that adds noise to the action seelction. – Constantinos – 2018-01-28T17:00:44.793