## When do SARSA and Q-Learning converge to optimal Q values?

3

Here's another interesting multiple-choice question that puzzles me a bit.

In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state, randomly selects an action, then:

1. Q-learning will converge to the optimal Q-values
2. SARSA will converge to the optimal Q-values
3. Q-learning is learning off-policy
4. SARSA is learning off-policy

My thoughts, and question: Since the actions are being sampled randomly from the action space, learning definitely seems to be off-policy (correct me if I'm wrong, please!). So that rules 3. and 4. as incorrect. Coming to the first two options, I'm not quite sure whether Q-learning and/or SARSA would converge in this case. All that I'm able to understand from the question is that the agent explores more than it exploits, since it visits all states (an infinite number of times) and also takes random actions (and not the best action!). How can this piece of information help me deduce if either process converges to the optimal Q-values or not?

Thanks a lot!

Source: Slide 2/55

3

The true answers are 1 and 3. 1 because the required conditions for tabular Q-learning to converge is that each state action pair will be visited infinitely often, and Q-learning learns directly about the greedy policy, $$\pi(a|s) := \arg \max_a Q_\pi(s,a)$$, and because Q-learning converges to the optimal Q-value function we know that the policy will be optimal (because the optimal policy is the greedy policy wrt the optimal Q-function).

3 is true because Q-learning is by definition an off-policy algorithm, because we learn about the greedy policy whilst following some arbitrary policy.

2 is false because SARSA is on-policy, so it will be learning the Q-function under the random policy, and 3 is false because SARSA is strictly on-policy, for reasons analogous to why Q-learning is off-policy.

So a completely random policy counts as a policy too? Isn't that as good as not having a policy and sampling from the action space itself? – cogito_ai – 2020-08-09T17:50:28.830

1That’s exactly what the random policy does, it picks from all viable actions in the given state with equal probability. It’s good for guaranteeing exploration in off-policy algorithms. – David Ireland – 2020-08-09T18:05:54.183

Great, thanks! Also, when would SARSA converge to the optimal Q values? From what I remember, it should happen when (i) all state-action pairs are visited infinitely often (ii) the policy converges to the greedy policy. Could you confirm/elaborate more on this? – cogito_ai – 2020-08-10T02:43:04.563

1Yes, that is correct. In practice you would learn about an $\epsilon$-greedy policy which is not technically optimal but very close to it. – David Ireland – 2020-08-10T09:07:55.843