## Does using the softmax function in Q learning not defeat the purpose of Q learning?

4

It is my understanding that, in Q-learning, you are trying to mimic the optimal $$Q$$ function $$Q*$$, where $$Q*$$ is a measure of the predicted reward received from taking action $$a$$ at state $$s$$ so that the reward is maximised.

I understand for this to be properly calculated, you must explore all possible game states, and as that is obviously intractable, a neural network is used to approximate this function.

In a normal case, the network is updated based on the MSE of the actual reward received and the networks predicted reward. So a simple network that is meant to chose a direction to move would receive a positive gradient for all state predictions for the entire game and do a normal backprop step from there.

However, to me, it makes intuitive sense to have the final layer of the network be a softmax function for some games. This is because in a lot of cases (like Go for example), only one "move" can be chosen per game state, and as such, only one neuron should be active. It also seems to me that would work well with the gradient update, and the network would learn appropriately.

But the big problem here is, this is no longer Q learning. The network no longer predicts the reward for each possible move, it now predicts which move is likely to give the greatest reward.

Am I wrong in my assumptions about Q learning? Is the softmax function used in Q learning at all?

1Just to let you know I have updated my answer as I thought of a way in which softamx distribution is used in Q learning, indirectly. – Neil Slater – 2019-12-08T14:30:16.797

3

However, to me, it makes intuitive sense to have the final layer of the network be a softmax function for some games. This is because in a lot of cases (like Go for example), only one "move" can be chosen per game state, and as such, only one neuron should be active.

You are describing a network that approximates as policy function, $$\pi(a|s)$$, for a discrete set of actions.

It also seems to me that would work well with the gradient update, and the network would learn appropriately.

Yes there are ways to do this, based on the Policy Gradient Theorem. If you read it you will probably discover this is more complex to understand than you first thought, the problem being that the agent is never directly told what the "best" action is in order to simply learn in a supervised manner. Instead, it has to be inferred from rewards observed whilst acting. This is a bit harder to figure out than the Q learning update rules which are just sampling from the Bellman optimality equation.

You can split Reinforcement Learning methods broadly into value-based methods and policy gradient methods. Q learning is a value-based method, whilst REINFORCE is a basic policy gradient method. It is also common to use a value based method within a policy gradient method in order to help estimate likely future return used to drive the polcy gradient updates - this combination is called Actor-Critic where the actor learns a policy function $$\pi(a|s)$$ and the critic learns a value function e.g. $$V(s)$$.

But the big problem here is, this is no longer Q learning. The network no longer predicts the reward for each possible move, it now predicts which move is likely to give the greatest reward.

This is true, but it is not a big problem. The main issue is that policy gradient methods are more complex than value based methods. They may or may not be more effective, it depends on the environment you are tryng to create an optimal agent for.

Is the softmax function used in Q learning at all?

I cannot think of any non-contrived environment in which this function would be useful for an action value approximation.

However, it is possible to use a variant of softmax to create a behaviour policy for Q learning. This uses a temperature hyperparameter $$T$$ to weight the Q values, and provide a probability of selecting an action, as follows

$$\pi(a_i|s) = \frac{e^{Q(s,a_i)/T}}{\sum_j e^{Q(s,a_j)/T}}$$

when $$T$$ is high all the probabilities of actions will be similar, when it is low even a small difference in $$Q(s,a_i)$$ will make a big difference to probability of selecting action $$a_i$$. This is quite a nice distribution for exploring whilst avoiding previously bad decisions. It will tend to focus the agent on exploring differences between similarly high rated actions. The main issue with it is that it introduces hyperparameters for deciding starting $$T$$, ending $$T$$ and how to move between them.

0

This is still Q-learning, remember Q-learning is off-policy value-based. For Bellman optimality operator $$\mathcal{T}Q=r+max\ Q'$$. If you have enough exploration, it always takes $$Q$$ to the optimal fixed point.