## Can Q-learning be used to derive a stochastic policy?

3

In my understanding, Q-learning gives you a deterministic policy. However, can we use some technique to build a meaningful stochastic policy from the learned Q values? I think that simply using a softmax won't work.

You could just choose a weighted random choice from your learned Q-values, rather than picking the maximum. – DrMcCleod – 2019-02-08T07:44:38.670

2

No it is not possible to use Q-learning to build a deliberately stochastic policy, as the learning algorithm is designed around choosing solely the maximising value at each step, and this assumption carries forward to the action value update step $$Q_{k+1}(S_t,A_t) = Q_k(S_t,A_t) + \alpha(R_{t+1} +\gamma\text{max}_{a'}Q_k(S_{t+1},a') - Q_k(S_t,A_t))$$ - i.e. the assumption is that the agent will always choose the highest Q value, and that in turn is used to calculate the TD target values. If you use a stochastic policy as the target policy, then the assumption is broken and the Q table (or approximator) would not converge to estimates of action value for the policy*.

The policy produced by Q-learning can only be treated as stochastic when there is more than one maximum action value in a particular state - in which case you can select equivalent maximising values using any distribution.

In theory you could use the Q values to derive various distributions, such as a Boltzmann distribution, or softmax as you suggest (you will want to include some weighting factor to make softmax work in general). These can work nicely for the behaviour policy, for further training, and as an alternative to the more common $$\epsilon$$-greedy approach. However, they are not optimal policies, and the training algorithm will not adjust the probabilities in any meaningful way related to the problem you want to solve. You can set a value for e.g. $$\epsilon$$ for $$\epsilon$$-greedy, or have more sophisticated action choice with more parameters, but no value-based method can provide a way to change those parameters to make action choice optimal.

In cases where a stochastic policy would perform better - e.g. Scissor, Paper, Stone versus an opponent exploiting patterns in the agent's behaviour - then value based methods provide no mechanism to learn a correct distribution, and they typically fail to learn well in such environments. Instead you need to look into policy gradient methods, where the policy function is learned directly and can be stochastic. The most basic policy gradient algorithm is REINFORCE, and variations on Actor-Critic such as A3C are quite popular.

* You could get around this limitation by using an estimator that does work with a stochastic target policy, e.g. SARSA or Expected SARSA. Expected SARSA can even be used off-policy to learn one stochastic policy's Q values whilst behaving differently. However, neither of these provide you with the ability to change the probability distribution towards an optimal one.

What do you mean by "drive" in "Q values to drive various distributions"? Maybe yo meant "derive"? Also, what is the relation between the behaviour policy and the fact that Q-learning can or not be used to derive a stochastic target policy? "However, they are not optimal policies", why can't these policies also be optimal? What are the optimal policies? Just the deterministic ones? – nbro – 2019-02-08T12:17:42.957

1@nbro: Actually I did mean "drive" as in use to control, but you are right "derive" would fit better. Your other questions should be asked separately, I already explained the general reason why in the first paragraph, it seems you want worked details/proof? Either way, the Q values reflect the target optimal policy, so choosing a different policy is guaranteed to be non-optimal (just reverse the policy improvement theorem to see this) – Neil Slater – 2019-02-08T12:21:42.173