Here's another interesting multiple-choice question that puzzles me a bit.
In tabular MDPs, if using a decision policy that visits all states an infinite number of times, and in each state, randomly selects an action, then:
- Q-learning will converge to the optimal Q-values
- SARSA will converge to the optimal Q-values
- Q-learning is learning off-policy
- SARSA is learning off-policy
My thoughts, and question: Since the actions are being sampled randomly from the action space, learning definitely seems to be off-policy (correct me if I'm wrong, please!). So that rules 3. and 4. as incorrect. Coming to the first two options, I'm not quite sure whether Q-learning and/or SARSA would converge in this case. All that I'm able to understand from the question is that the agent explores more than it exploits, since it visits all states (an infinite number of times) and also takes random actions (and not the best action!). How can this piece of information help me deduce if either process converges to the optimal Q-values or not?
Thanks a lot!
Source: Slide 2/55