Equivalence between expected parameter increments in "Off-Policy Temporal-Difference Learning with Function Approximation"


I am having a hard time understanding the proof of theorem 1 presented in the "Off-Policy Temporal-Difference Learning with Function Approximation" paper.

Let $\Delta \theta$ and $\Delta \bar{\theta}$ be the sum of the parameter increments over an episode under on-policy $T D(\lambda)$ and importance sampled $T D(\lambda)$ respectively, assuming that the starting weight vector is $\theta$ in both cases. Then

$E_{b}\left\{\Delta \bar{\theta} | s_{0}, a_{0}\right\}=E_{\pi}\left\{\Delta \theta | s_{0}, a_{0}\right\}, \quad \forall s_{0} \in \mathcal{S}, a_{0} \in \mathcal{A}$

We know that: $$ \begin{aligned} &\Delta \theta_{t}=\alpha\left(R_{t}^{\lambda}-\theta^{T} \phi_{t}\right) \phi_{t}\\ &R_{t}^{\lambda}=(1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} R_{t}^{(n)}\\ &R_{t}^{(n)}=r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^{n-1} r_{t+n}+\gamma^{n} \theta^{T} \phi_{t+n} \end{aligned} $$

and $$\Delta \bar{\theta_{t}}=\alpha\left(\bar{R}_{t}^{\lambda}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}$$ $$ \begin{aligned} \bar{R}_{t}^{(n)}=& r_{t+1}+\gamma r_{t+2} \rho_{t+1}+\cdots \\ &+\gamma^{n-1} r_{t+n} \rho_{t+1} \cdots \rho_{t+n-1} \\ &+\gamma^{n} \rho_{t+1} \cdots \rho_{t+n} \theta^{T} \phi_{t+n} \end{aligned} $$

And it is proven that: $$ E_{b}\left\{\bar{R}_{t}^{\lambda} | s_{t}, a_{t}\right\}=E_{\pi}\left\{R_{t}^{\lambda} | s_{t}, a_{t}\right\} $$

Here is the proof, it begins with:

$E_{b}\{\Delta \bar{\theta}\}=E_{b}\left\{\sum_{t=0}^{\infty} \alpha\left(\bar{R}_{t}^{\lambda}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$ $=E_{b}\left\{\sum_{t=0}^{\infty} \sum_{n=1}^{\infty} \alpha(1-\lambda) \lambda^{n-1}\left(\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$.

which I believe is incorrect since,

$E_{b}\{\Delta \bar{\theta}\}=E_{b}\left\{\sum_{t=0}^{\infty} \alpha\left(\bar{R}_{t}^{\lambda}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$ $=E_{b}\left\{\sum_{t=0}^{\infty} \alpha \left(\sum_{n=1}^{\infty}(1-\lambda) \lambda^{n-1}\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$.

and taking out the second sigma will lead to a sum over constant terms.

Furthermore, it is claimed that in order to prove the equivalence above, it is enough to prove the equivalence below: $$ \begin{array}{c} E_{b}\left\{\sum_{t=0}^{\infty}\left(\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\} \\ =E_{\pi}\left\{\sum_{t=0}^{\infty}\left(R_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t}\right\} \end{array} $$

Which I don't understand why. and even if it is the case there are more ambiguities in the proof:

$E_{b}\left\{\sum_{t=0}^{\infty}\left(\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t} \rho_{1} \rho_{2} \cdots \rho_{t}\right\}$ $$=\sum_{t=0}^{\infty} \sum_{\omega \in \Omega_{t}} p_{b}(\omega) \phi_{t} \prod_{k=1}^{t} \rho_{k} E_{b}\left\{\bar{R}_{t}^{(n)}-\theta^{T} \phi_{t} | s_{t}, a_{t}\right\}$$ (given the Markov property, and I don't understand why Markovian property leads to conditional independence !) $$=\sum_{t=0}^{\infty} \sum_{\omega \in \Omega_{t}} \prod_{j=1}^{t} p_{s_{j-1}, s_{j}}^{a_{j-1}} b\left(s_{j}, a_{j}\right) \phi_{t} \prod_{k=1}^{t} \frac{\pi\left(s_{k}, a_{k}\right)}{b\left(s_{k}, a_{k}\right)} \cdot \left(E_{b}\left\{\bar{R}_{t}^{(n)} | s_{t}, a_{t}\right\}-\theta^{T} \phi_{t}\right)$$

$$= \sum_{t=0}^{\infty} \sum_{\omega \in \Omega_{t}} \prod_{j=1}^{t} p_{s_{j-1}, s_{j}}^{a_{j-1}} \pi\left(s_{j}, a_{j}\right) \phi_{t} \cdot\left(E_{b}\left\{\bar{R}_{t}^{(n)} | s_{t}, a_{t}\right\}-\theta^{T} \phi_{t}\right)$$

$$=\sum_{t=0}^{\infty} \sum_{\omega \in \Omega_{t}} p_{\pi}(\omega) \phi_{t}\left(E_{\pi}\left\{R^{(n)} | s_{t}, a_{t}\right\}-\theta^{T} \phi_{t}\right)$$ (using our previous result) $$=E_{\pi}\left\{\sum_{t=0}^{\infty}\left(R_{t}^{(n)}-\theta^{T} \phi_{t}\right) \phi_{t}\right\} . \diamond$$

I'd be grateful if anyone could shed a light on this.


Posted 2020-04-07T10:36:06.187

Reputation: 23



First part is correct \begin{align} &\sum_{n=1}^{\infty} \alpha(1-\lambda)\lambda^{n-1} (\bar R_t^{(n)} - \theta^T \phi_t)\\ =& \alpha[\sum_{n=1}^{\infty} (1-\lambda)\lambda^{n-1} \bar R_t^{(n)} - \sum_{n=1}^{\infty} (1-\lambda)\lambda^{n-1} \theta^T \phi_t] \end{align} $\sum_{n=1}^{\infty} (1-\lambda)\lambda^{(n-1)}$ sums to $1$ so we have \begin{equation} \alpha[\sum_{n=1}^{\infty} (1-\lambda)\lambda^{n-1} \bar R_t^{(n)} - \theta^T \phi_t] \end{equation} For the second part it's enough to prove equivalence for any $n$ because result contains sum over $n$. If you have 2 sums $\sum x_n$, $\sum y_n$ then the sums will be equal if for any $n$, $x_n = y_n$.

For the third part, we are in state $s_t$ and we already took action $a_t$ so we have \begin{align} &E_b \{ \sum_{t=0}^{\infty} (\bar R_t^{(n)} - \theta^T\phi_t)\phi_t \rho_1\rho_2\cdots\rho_t\}\\ =& \sum_{t=0}^{\infty} E_b \{(\bar R_t^{(n)} - \theta^T\phi_t)\phi_t \rho_1\rho_2\cdots\rho_t\}\\ =& \sum_{t=0}^{\infty} E_b \{\phi_t \rho_1\rho_2\cdots \rho_t\} E_b \{(\bar R_t^{(n)} - \theta^T\phi_t)|s_t, a_t\} \end{align} that is because $\rho_i, i = 1, \ldots, t-1$ depends on $s_i, a_i$. Because of Markov property expectation over $\bar R_t^{(n)}$ doesn't depend on those state it only depends on $s_t, a_t$ so they are independent. We don't need to consider $\phi_t$ and $\rho_t$ in expectation over $\bar R_t^{(n)}$ either because, like I said, we are in state $s_t$ and we took $a_t$ so they are already decided they would be considered a constant. We can then split total expectation in part $E_b \{\phi_t \rho_1\rho_2\cdots \rho_t\}$ for getting to state $s_t$ and taking action $a_t$ and part $E_b \{(\bar R_t^{(n)} - \theta^T\phi_t)|s_t, a_t\}$ for expectation over $\bar R_t^{(n)}$ after we got to state $s_t$ and took action $a_t$.


Posted 2020-04-07T10:36:06.187

Reputation: 1 664