## Equivalence between expected parameter increments in "Off-Policy Temporal-Difference Learning with Function Approximation"

2

I am having a hard time understanding the proof of theorem 1 presented in the "Off-Policy Temporal-Difference Learning with Function Approximation" paper.

Let $$\Delta \theta$$ and $$\Delta \bar{\theta}$$ be the sum of the parameter increments over an episode under on-policy $$T D(\lambda)$$ and importance sampled $$T D(\lambda)$$ respectively, assuming that the starting weight vector is $$\theta$$ in both cases. Then

$$E_{b}\left\{\Delta \bar{\theta} | s_{0}, a_{0}\right\}=E_{\pi}\left\{\Delta \theta | s_{0}, a_{0}\right\}, \quad \forall s_{0} \in \mathcal{S}, a_{0} \in \mathcal{A}$$

We know that: \begin{aligned} &\Delta \theta_{t}=\alpha\left(R_{t}^{\lambda}-\theta^{T} \phi_{t}\right) \phi_{t}\\ &R_{t}^{\lambda}=(1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} R_{t}^{(n)}\\ &R_{t}^{(n)}=r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^{n-1} r_{t+n}+\gamma^{n} \theta^{T} \phi_{t+n} \end{aligned}

and $$\Delta \bar{\theta_{t}}=\alpha\left(\bar{R}_{t}^{\lambda}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}$$ \begin{aligned} \bar{R}_{t}^{(n)}=& r_{t+1}+\gamma r_{t+2} \rho_{t+1}+\cdots \\ &+\gamma^{n-1} r_{t+n} \rho_{t+1} \cdots \rho_{t+n-1} \\ &+\gamma^{n} \rho_{t+1} \cdots \rho_{t+n} \theta^{T} \phi_{t+n} \end{aligned}

And it is proven that: $$E_{b}\left\{\bar{R}_{t}^{\lambda} | s_{t}, a_{t}\right\}=E_{\pi}\left\{R_{t}^{\lambda} | s_{t}, a_{t}\right\}$$

Here is the proof, it begins with:

$$E_{b}\{\Delta \bar{\theta}\}=E_{b}\left\{\sum_{t=0}^{\infty} \alpha\left(\bar{R}_{t}^{\lambda}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$$ $$=E_{b}\left\{\sum_{t=0}^{\infty} \sum_{n=1}^{\infty} \alpha(1-\lambda) \lambda^{n-1}\left(\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$$.

which I believe is incorrect since,

$$E_{b}\{\Delta \bar{\theta}\}=E_{b}\left\{\sum_{t=0}^{\infty} \alpha\left(\bar{R}_{t}^{\lambda}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$$ $$=E_{b}\left\{\sum_{t=0}^{\infty} \alpha \left(\sum_{n=1}^{\infty}(1-\lambda) \lambda^{n-1}\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$$.

and taking out the second sigma will lead to a sum over constant terms.

Furthermore, it is claimed that in order to prove the equivalence above, it is enough to prove the equivalence below: $$\begin{array}{c} E_{b}\left\{\sum_{t=0}^{\infty}\left(\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\} \\ =E_{\pi}\left\{\sum_{t=0}^{\infty}\left(R_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t}\right\} \end{array}$$

Which I don't understand why. and even if it is the case there are more ambiguities in the proof:

$$E_{b}\left\{\sum_{t=0}^{\infty}\left(\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$$ $$=\sum_{t=0}^{\infty} \sum_{\omega \in \Omega_{t}} p_{b}(\omega) \phi_{t} \prod_{k=1}^{t} \rho_{k} E_{b}\left\{\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t} | s_{t}, a_{t}\right\}$$ (given the Markov property, and I don't understand why Markovian property leads to conditional independence !) $$=\sum_{t=0}^{\infty} \sum_{\omega \in \Omega_{t}} \prod_{j=1}^{t} p_{s_{j-1}, s_{j}}^{a_{j-1}} b\left(s_{j}, a_{j}\right) \phi_{t} \prod_{k=1}^{t} \frac{\pi\left(s_{k}, a_{k}\right)}{b\left(s_{k}, a_{k}\right)} \cdot \left(E_{b}\left\{\bar{R}_{t}^{(n)} | s_{t}, a_{t}\right\}-\theta^{T} \phi_{t}\right)$$

$$= \sum_{t=0}^{\infty} \sum_{\omega \in \Omega_{t}} \prod_{j=1}^{t} p_{s_{j-1}, s_{j}}^{a_{j-1}} \pi\left(s_{j}, a_{j}\right) \phi_{t} \cdot\left(E_{b}\left\{\bar{R}_{t}^{(n)} | s_{t}, a_{t}\right\}-\theta^{T} \phi_{t}\right)$$

$$=\sum_{t=0}^{\infty} \sum_{\omega \in \Omega_{t}} p_{\pi}(\omega) \phi_{t}\left(E_{\pi}\left\{R^{(n)} | s_{t}, a_{t}\right\}-\theta^{T} \phi_{t}\right)$$ (using our previous result) $$=E_{\pi}\left\{\sum_{t=0}^{\infty}\left(R_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t}\right\} . \diamond$$

I'd be grateful if anyone could shed a light on this.

First part is correct \begin{align} &\sum_{n=1}^{\infty} \alpha(1-\lambda)\lambda^{n-1} (\bar R_t^{(n)} - \theta^T \phi_t)\\ =& \alpha[\sum_{n=1}^{\infty} (1-\lambda)\lambda^{n-1} \bar R_t^{(n)} - \sum_{n=1}^{\infty} (1-\lambda)\lambda^{n-1} \theta^T \phi_t] \end{align} $$\sum_{n=1}^{\infty} (1-\lambda)\lambda^{(n-1)}$$ sums to $$1$$ so we have $$\begin{equation} \alpha[\sum_{n=1}^{\infty} (1-\lambda)\lambda^{n-1} \bar R_t^{(n)} - \theta^T \phi_t] \end{equation}$$ For the second part it's enough to prove equivalence for any $$n$$ because result contains sum over $$n$$. If you have 2 sums $$\sum x_n$$, $$\sum y_n$$ then the sums will be equal if for any $$n$$, $$x_n = y_n$$.
For the third part, we are in state $$s_t$$ and we already took action $$a_t$$ so we have \begin{align} &E_b \{ \sum_{t=0}^{\infty} (\bar R_t^{(n)} - \theta^T\phi_t)\phi_t \rho_1\rho_2\cdots\rho_t\}\\ =& \sum_{t=0}^{\infty} E_b \{(\bar R_t^{(n)} - \theta^T\phi_t)\phi_t \rho_1\rho_2\cdots\rho_t\}\\ =& \sum_{t=0}^{\infty} E_b \{\phi_t \rho_1\rho_2\cdots \rho_t\} E_b \{(\bar R_t^{(n)} - \theta^T\phi_t)|s_t, a_t\} \end{align} that is because $$\rho_i, i = 1, \ldots, t-1$$ depends on $$s_i, a_i$$. Because of Markov property expectation over $$\bar R_t^{(n)}$$ doesn't depend on those state it only depends on $$s_t, a_t$$ so they are independent. We don't need to consider $$\phi_t$$ and $$\rho_t$$ in expectation over $$\bar R_t^{(n)}$$ either because, like I said, we are in state $$s_t$$ and we took $$a_t$$ so they are already decided they would be considered a constant. We can then split total expectation in part $$E_b \{\phi_t \rho_1\rho_2\cdots \rho_t\}$$ for getting to state $$s_t$$ and taking action $$a_t$$ and part $$E_b \{(\bar R_t^{(n)} - \theta^T\phi_t)|s_t, a_t\}$$ for expectation over $$\bar R_t^{(n)}$$ after we got to state $$s_t$$ and took action $$a_t$$.