## Understanding the equation of TD(0) in the paper "Learning to predict by the methods of temporal differences"

5

1

In the paper Learning to predict by the methods of temporal differences (p. 15), the weights in the temporal difference learning are updated as given by the equation $$\Delta w_t = \alpha \left(P_{t+1} - P_t\right) \sum_{k=1}^{t}{\lambda^{t-k} \nabla_w P_k} \tag{4} \,.$$ When $$\lambda = 0$$, as in TD(0), how does the method learn? As it appears, with $$\lambda = 0$$, there will never be a change in weight and hence no learning.

Am I missing anything?

5

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $$k = t$$) has $$\lambda$$ raised to the power $$0$$, and anything raised to the power $$0$$ (even $$0$$) is equal to $$1$$. So, for $$\lambda = 0$$, your update equation becomes

$$\Delta w_t = \alpha \left( P_{t+1} - P_t \right) \nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

1

At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $\Delta w_t = \alpha \left( P_{t+1} - P_t \right) \nabla_w P_t$ is the learning rule when $\lambda = 0$.

– nbro – 2019-06-01T16:51:36.850

1He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed? – nbro – 2019-06-01T16:58:09.543

2@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else. – Dennis Soemers – 2019-06-01T17:33:29.070