2

I am trying to understand the proof of theorem 2.1 from this paper:

Ross, Stéphane, and Drew Bagnell. "Efficient reductions for imitation learning." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

The cost-to-go is given as

$$J(\pi) = \sum_{t=1}^{T}\mathbb{E}_{s\,\sim\, d^t_{\pi}(s)}\left[C_\pi(s)\right].$$

In the paper they use $\hat{\pi}=\pi$ for the learned policy and $\pi^*$ for the expert policy.

In the derivation they write

$$J(\pi)\leq \sum_{t=1}^{T}\{ p_{t-1}\mathbb{E}_{s\, \sim \, d_t(s)}\left[C_\pi(s) \right]+(1-p_{t-1})\}$$ $$\leq \sum_{t=1}^{T}\{ p_{t-1}\mathbb{E}_{s\, \sim \, d_t(s)}\left[C_{\pi^*}(s) \right]+p_{t-1}{\ell_t(s,\pi)}+(1-p_{t-1})\},$$

in which $p_{t-1}$ is the probability of not not making an error with policy $\pi$ up to the time $t-1$. And $\ell$ is the surrogate 0-1 loss.

The following steps are easy to follow, but how did they come up with these steps?