In some implementations of off-policy Q learning we need to know the action probabilities given by the behavior policy
mu(a) (e.g., if we want to use importance sampling).
In my case, I am using Deep Q-Learning and selecting actions using Thompson Sampling. I implemented this following the approach in "What My Deep Model Doesn't Know...": I added dropout to my Q-network and select actions by performing a single stochastic forward pass through the Q-network (i.e., with dropout enabled) and choosing the action with the highest Q-value.
So, how can I calculate
mu(a) when using Thompson Sampling based on dropout?