Does TD(0) prediction require Robbins-Monro conditions to converge to the value function?



Does the learning rate parameter $\alpha$ require the Robbins-Monro conditions below for the TD(0) algorithm to converge to the true value function of a policy?

$$\sum \alpha_t =\infty \quad \text{and}\quad \sum \alpha^{2}_t <\infty$$


Posted 2020-02-24T18:00:33.417

Reputation: 169

This question was also asked at

– nbro – 2020-04-15T20:02:27.353



The paper Convergence of Q-learning: A Simple Proof (by Francisco S. Melo) shows (theorem 1) that Q-learning, a TD(0) algorithm, converges with probability 1 to the optimal Q-function as long as the Robbins-Monro conditions, for all combinations of states and actions, are satisfied. In other words, the Robbins-Monro conditions are sufficient for Q-learning to converge to the optimal Q-function in the case of a finite MDP. The proof of theorem 1 uses another theorem from stochastic approximation (theorem 2).

You are interested in the prediction problem, that is, the problem of predicting the expected return (i.e. a value function) from a fixed policy. However, Q-learning is also a control algorithm, given that it can find the optimal policy from the corresponding learned Q-function.

See also the question Why doesn't Q-learning converge when using function approximation?.


Posted 2020-02-24T18:00:33.417

Reputation: 19 783

Thanks for that, I was wondering is it true that TD(0) converges to the value function in Expected value when $\alpha$ is constant? Is there a condition that $\alpha \in (0,1)$? – KaneM – 2020-02-24T23:23:39.107

1@KaneM Right now, I am not aware of this result. Please, ask another question on the site because may someone else can answer your question. – nbro – 2020-02-24T23:28:09.190


The result is in Sutton's Learning to Predict by the Methods of Temporal Differences , on page 24 of this version. I think I understand the result now. It is saying that there is some range of alphas $\alpha \in (0,t)$ such that TD(0) converges to the value function in expected value.

– KaneM – 2020-02-25T00:05:21.327