Action Probability with Thompson Sampling in Deep Reinforcement Learning


In some implementations of off-policy Q learning we need to know the action probabilities given by the behavior policy mu(a) (e.g., if we want to use importance sampling).

In my case, I am using Deep Q-Learning and selecting actions using Thompson Sampling. I implemented this following the approach in "What My Deep Model Doesn't Know...": I added dropout to my Q-network and select actions by performing a single stochastic forward pass through the Q-network (i.e., with dropout enabled) and choosing the action with the highest Q-value.

So, how can I calculate mu(a) when using Thompson Sampling based on dropout?


Posted 2018-06-15T09:11:56.317

Reputation: 33



So, how can I calculate mu(a) when using Thompson Sampling based on dropout?

The only way I could see this being calculated is if you iterate over all possible dropout combinations, or as an approximation sample say 100 or 1000 actions with different dropout, to get a rough distribution.

I don't think this is feasible for practical reasons (the agent will learn so much more slowly due to these calculations, you may as well abandon Thompson Sampling and use epsilon-greedy), and you will have to avoid using importance sampling if you also want to use action-selection techniques where there is no easy way to calculate a distribution.

Many forms of Q-learning do not use importance sampling. These typically just reset eligibility traces if the selected action is different from maximising action.

Neil Slater

Posted 2018-06-15T09:11:56.317

Reputation: 14 632