Equations in "Intoduction to RL": What is the meaning and difference between E, and E with subscript?


This question is from An introduction to RL, page 78. In the formula below the page, both

$\mathbb{E}$ and $\mathbb{E_\pi}$

are mentioned. Could you help me understand the difference between these two in this page and in general?

Melanie A

Posted 2018-09-07T14:16:55.560

Reputation: 37



In general, the expectation is taken with respect to some random variable X. Often, when dealing with a single random variable, it can be implicitly inferred over which random variable it is being integrated and thus writing $\mathbb{E}$ suffices. However, when dealing with multiple random variables, this is not the case anymore. Then, a subscript denotes with respect to which random variable the expectation is taken.

However, the subscript can also denote on which random variable to condition on. This seems to me to be the case on the page you are referring to. Here, $\mathbb{E}_{\pi'}$ means that you are conditioning on actions which are distributed according to $\pi'$.

(Just as a side note: sometimes you might see something along these lines of: $\mathbb{E}_{a\sim \pi'}$, the subscript here is the actual notation for "$a$ distributed according to $\pi'$")

For a more technical answer, have a look at this question on Cross Validated.

Andrei Poehlmann

Posted 2018-09-07T14:16:55.560

Reputation: 146


From the notation section starting on page $xix$, the subscript $_{\pi}$ seems to be read:

... under the policy $\pi$

So $\mathbb{E}$ is the expectation, whereas

$\mathbb{E_\pi}$ is the expectation under the policy $\pi$.

We could compute the expectation of a set of random numbers selected from 1 to 10. If the probabilities of selecting each of the numbers in that range are all equal, we can simple take the weighted mean. This would be equal to 5.5.

However, if we base the selection on some unequal weights, so non-random action (just like a policy would give us), we have unequal weights. Now the answer is not going to be the simple mean, but rather a value skewed towards the heavier weights i.e. the more likely selections.

In those specific equations, if I am not mistaken, the authors simply put the subscript there in $\mathbb{E}_{\pi}$ to make it clear that we are working under the policy $\pi$. The subscript appears only on the third line, because the policy is removed from the conditional part of the expection - the part after the vertical bar. So we remove the relevant policy term: $A_t = \pi'(s)$, and just indicate we are working with the policy via the subscript.


Posted 2018-09-07T14:16:55.560

Reputation: 12 573