3

Here's another interesting multiple-choice question that puzzles me a bit.

In tabular MDPs, if using a decision policy that visits all states an

infinitenumber of times, and in each state,randomlyselects an action, then:

- Q-learning will converge to the optimal Q-values
- SARSA will converge to the optimal Q-values
- Q-learning is learning off-policy
- SARSA is learning off-policy

*My thoughts, and question:* Since the actions are being sampled randomly from the action space, learning definitely seems to be **off-policy** (correct me if I'm wrong, please!). So that rules 3. and 4. as incorrect. Coming to the first two options, I'm not quite sure whether Q-learning and/or SARSA would converge in this case. All that I'm able to understand from the question is that the agent **explores more than it exploits**, since it visits all states (an infinite number of times) and also takes random actions (and not the best action!). How can this piece of information help me deduce if either process converges to the optimal Q-values or not?

Thanks a lot!

Source: Slide 2/55

So a completely random policy counts as a policy too? Isn't that as good as not having a policy and sampling from the action space itself? – cogito_ai – 2020-08-09T17:50:28.830

1That’s exactly what the random policy does, it picks from all viable actions in the given state with equal probability. It’s good for guaranteeing exploration in off-policy algorithms. – David Ireland – 2020-08-09T18:05:54.183

Great, thanks! Also, when would SARSA converge to the optimal Q values? From what I remember, it should happen when (i) all state-action pairs are visited infinitely often (ii) the policy converges to the greedy policy. Could you confirm/elaborate more on this? – cogito_ai – 2020-08-10T02:43:04.563

1Yes, that is correct. In practice you would learn about an $\epsilon$-greedy policy which is not technically optimal but very close to it. – David Ireland – 2020-08-10T09:07:55.843