What happens when you select actions using softmax instead of epsilon greedy in DQN?

3

I understand the two major branches of RL are Q-Learning and Policy Gradient methods.

From my understanding (correct me if I'm wrong), policy gradient methods have an inherent exploration built-in as it selects actions using a probability distribution.

On the other hand, DQN explores using the $\epsilon$-greedy policy. Either selecting the best action or a random action.

What if we use a softmax function to select the next action in DQN? Does that provide better exploration and policy convergence?

Linsu Han

Posted 2020-06-23T16:47:51.683

Reputation: 33

Answers

1

DQN on the other hand, explores using epsilon greedy exploration. Either selecting the best action or a random action.

This is a very common choice, because it is simple to implement and quite robust. However, it is not a requirement of DQN. You can use other action choice mechanisms, provided all choices are covered with a non-zero probability of being selected.

What if we use a softmax function to select the next action in DQN? Does that provide better exploration and policy convergence?

It might in some circumstances. A key benefit is that it will tend to focus on action choices that are close to its current best guess at optimal. One problem is that if there is a large enough error in Q value estimates, it can get stuck as the exploration could heavily favour a current best value estimate. For instance, if one estimate is accurate and relatively high, but another estimate is much lower but in reality would be a good action choice, then the softmax probabilities to resample the bad estimate will be very low and it could take a very long time to fix.

A more major problem is that the Q values are not independent logits that define preferences (whilst they would be in a Policy Gradient approach). The Q values have an inherent meaning and scale based on summed rewards. Which means that differences between optimal and non-optimal Q value estimates could be at any scale, maybe just 0.1 difference in value, or maybe 100 or more. This makes plain softmax a poor choice - it might suggest a near random exploration policy in one problem, and a near determinitsic policy in another, irrespective of what exploration might be useful at the current stage of learning.

A fix for this is to use Gibbs/Boltzmann action selection, which modifies softmax by adding a scaling factor - often called temperature and noted as $T$ - to adjust the relative scale between action choices:

$$\pi(a|s) = \frac{e^{q(s,a)/T}}{\sum_{x \in \mathcal{A}} e^{q(s,x)/T}}$$

This can work nicely to focus later exploration towards refining differences between actions that are likely to be good whilst only rarely making obvious mistakes. However it comes at a cost - you have to decide starting $T$, the rate to decay $T$ and an end value of $T$. A rough idea of min/max action value that the agent is likely to estimate can help.

Neil Slater

Posted 2020-06-23T16:47:51.683

Reputation: 14 632