## How to show temporal difference methods converge to MLE?

4

1

In chapter 6 of Sutton and Barto (p. 128), they claim temporal difference converges to the maximum likelihood estimate (MLE). How can this be shown formally?

5

The convergence and optimality proofs of (linear) temporal-difference methods (under batch training, so not online learning) can be found in the paper Learning to predict by the methods of temporal differences (1988) by Richard Sutton, specifically section 4 (p. 23). In this paper, Sutton uses a different notation than the notation used in the famous book Reinforcement Learning: An Introduction (2nd ed.), by Sutton and Barto, so I suggest you get familiar with the notation before attempting to understand the theorem and the proof. For example, Sutton uses letters such as $$i$$ and $$j$$ to denote states (rather than $$s$$), $$z$$ to denote (scalar) outcomes and $$x$$ to denote (vector) observations (see section 3.2 for example of the usage of this notation).

In the paper The Convergence of TD($$\lambda$$) for General $$\lambda$$ (1992), Peter Dayan, apart from recapitulating the convergence proof provided by Sutton, he also shows the convergence properties of TD($$\lambda$$) and he extends Watkins' Q-learning convergence theorem, whose sketch is presented in his PhD thesis Learning from Delayed Rewards (1989), and defined in detail in Technical Note: Q-learning (1992), by Dayan and Watkins, to provide the first strongest guarantee or convergence proof for TD(0).

There is much more research work on the convergence properties of TD methods, such as Q-learning and SARSA. For example, in the paper On the Convergence of Stochastic Iterative Dynamic Programming Algorithms (1994), where Q-learning is presented as a stochastic form of dynamic programming methods, the authors provide a proof of convergence for Q-learning by making direct use of stochastic approximation theory. See also Convergence of Q-learning: a simple proof by Francisco S. Melo. In the paper Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms, the authors provide a proof of the convergence properties of on-line temporal difference methods (e.g. SARSA).