## How to show temporal difference methods converge to MLE?

4

1

In chapter 6 of Sutton and Barto (p. 128), they claim temporal difference converges to the maximum likelihood estimate (MLE). How can this be shown formally?

The convergence and optimality proofs of (linear) temporal-difference methods (under batch training, so not online learning) can be found in the paper Learning to predict by the methods of temporal differences (1988) by Richard Sutton, specifically section 4 (p. 23). In this paper, Sutton uses a different notation than the notation used in the famous book Reinforcement Learning: An Introduction (2nd ed.), by Sutton and Barto, so I suggest you get familiar with the notation before attempting to understand the theorem and the proof. For example, Sutton uses letters such as $$i$$ and $$j$$ to denote states (rather than $$s$$), $$z$$ to denote (scalar) outcomes and $$x$$ to denote (vector) observations (see section 3.2 for example of the usage of this notation).
In the paper The Convergence of TD($$\lambda$$) for General $$\lambda$$ (1992), Peter Dayan, apart from recapitulating the convergence proof provided by Sutton, he also shows the convergence properties of TD($$\lambda$$) and he extends Watkins' Q-learning convergence theorem, whose sketch is presented in his PhD thesis Learning from Delayed Rewards (1989), and defined in detail in Technical Note: Q-learning (1992), by Dayan and Watkins, to provide the first strongest guarantee or convergence proof for TD(0).