Understanding the equation of TD(0) in the paper "Learning to predict by the methods of temporal differences"

5

1

In the paper Learning to predict by the methods of temporal differences (p. 15), the weights in the temporal difference learning are updated as given by the equation $$ \Delta w_t = \alpha \left(P_{t+1} - P_t\right) \sum_{k=1}^{t}{\lambda^{t-k} \nabla_w P_k} \tag{4} \,.$$ When $\lambda = 0$, as in TD(0), how does the method learn? As it appears, with $\lambda = 0$, there will never be a change in weight and hence no learning.

Am I missing anything?

Amanda

Posted 2019-06-01T14:41:50.517

Reputation: 205

Answers

5

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $\lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $\lambda = 0$, your update equation becomes

$$\Delta w_t = \alpha \left( P_{t+1} - P_t \right) \nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

Dennis Soemers

Posted 2019-06-01T14:41:50.517

Reputation: 7 644

1

At page $16$ of the same paper Learning to Predict by the Methods of Temporal Differences (1988), Sutton actually states that $\Delta w_t = \alpha \left( P_{t+1} - P_t \right) \nabla_w P_t$ is the learning rule when $\lambda = 0$.

– nbro – 2019-06-01T16:51:36.850

1He starts with the supervised setting and then derives the Widrow-Hoff (or delta) rule. The TD rule is then a special case of the delta rule, where the errors $z - P_t$ are replaced with a summation of the successive temporal-difference predictions. However, how is that specific 1-step TD learning rule exactly related to the usual learning rules of (tabular) temporal difference methods, where apparently no gradient is needed? – nbro – 2019-06-01T16:58:09.543

2@nbro You can view tabular methods as methods using linear function "approximation", where there is a single binary feature for every possible state-action pair. Then there would be a gradient needed, but the gradient would simply be $1$ for the "binary feature" corresponding to the state-action pair, and $0$ everywhere else. – Dennis Soemers – 2019-06-01T17:33:29.070