It is my understanding that, in Q-learning, you are trying to mimic the optimal $Q$ function $Q*$, where $Q*$ is a measure of the predicted reward received from taking action $a$ at state $s$ so that the reward is maximised.
I understand for this to be properly calculated, you must explore all possible game states, and as that is obviously intractable, a neural network is used to approximate this function.
In a normal case, the network is updated based on the MSE of the actual reward received and the networks predicted reward. So a simple network that is meant to chose a direction to move would receive a positive gradient for all state predictions for the entire game and do a normal backprop step from there.
However, to me, it makes intuitive sense to have the final layer of the network be a softmax function for some games. This is because in a lot of cases (like Go for example), only one "move" can be chosen per game state, and as such, only one neuron should be active. It also seems to me that would work well with the gradient update, and the network would learn appropriately.
But the big problem here is, this is no longer Q learning. The network no longer predicts the reward for each possible move, it now predicts which move is likely to give the greatest reward.
Am I wrong in my assumptions about Q learning? Is the softmax function used in Q learning at all?