Why does Q-learning converges to optimal policy even if I am acting suboptimally?



In Q-learning, during training, it doesn’t matter how I select actions. The algorithm always converges to optimal optimal policy. Why does this happen?

Shifat E Arman

Posted 2018-11-17T22:06:42.443

Reputation: 53


You can find a proof of the convergence of the Q-learning algorithm in the paper Convergence of Q-learning: A Simple Proof by Francisco S. Melo.

– nbro – 2018-11-18T01:38:47.577

1Your statement "it doesn’t matter how I select actions" is not really true. Q-learning "requires that all state-action pairs be visited infinitely often", as it's mentioned in the paper I linked you to above and e.g. in the book RL: An Introduction by Barto and Sutton. – nbro – 2018-11-18T01:40:40.503

1So, what is your real question? Are you looking for a proof? If yes, then you can find it in the paper above. Or are you looking for an intuition behind the convergence of Q-learning? – nbro – 2018-11-18T01:41:50.150

I am actually looking for an intuition behind this. – Shifat E Arman – 2018-11-20T14:45:51.123

No answers