3

I am trying to solve some questions about a MRP (i.e. a Markov Decision process with only one possible action at each state). The setup is as follows:

There are two states ($a$ and $b$) stepping to $a$ is terminal.

All rewards are zero, discount for stepping from $b \to b$ is $1$, and discount for all other steps is zero

The possible scenarios are: $a\to b$ (probability $1$), $b\to a$ (probability $p$) and $b\to b$ (probability $1-p$).

The first question I have is whether or not its true that the optimal values for each state here are zero? If not how did you derive this?

Second question I have is if we want a parameter $\lambda$ and have a feature $\phi$ such that

$\lambda \times \phi(a)$ and $\lambda \times \phi(b)$

approximate the optimal values at the states a and b and we attempt to approximate such a $\lambda$ by means of TD(0) starting with $\lambda_0 = 1$ how can I find the expected value of $\lambda$ after one episode of updating in terms of $p$? ($E[\lambda_T]$ where $T$ is a random number representing the duration of the episode)

By an episode I mean if we go back from state b to a the episode is over.

1I am also not sure about your definition of "optimal" in "optimal values". There is nothing to optimise here. You can calculate expected discounted future reward. You cannot optimise it. – Neil Slater – 2018-04-14T08:09:03.687

1Looking at the problem definition, and imagining it to be part of some course on RL, I would guess that the setup is

undiscounted(i.e. a discount factor $\gamma$ of 1 which can be ignored) - although perhaps you want to work with immediate reward only (in which case $\gamma = 0$), and that where you have used "discount" you meant "reward"? Otherwise the problem makes little sense to me. In case I am wrong could you give a link to definitions of MRP, reward and discount that you are using? – Neil Slater – 2018-04-14T08:13:39.733I am not allowed to edit but I meant to say in the above that $\lambda_0 = 1$ – BlagBlug1987 – 2018-04-13T21:54:12.177

The optimal value function should be the maximum Value with respect to possible actions at the current state. As Neil states your value will be the expected return which in all cases is zero as your reward is zero. Also you are mentioning only one action per state but i see two while in b: b-->b, b-->a. It seems to me also kind of weird to define different discounts for each transition. By the way, whilst b-->a terminates the episode, so you do not care about discount, the fact that for b-->b you assign discount one you will end up with a sum ore rewards that never converges. – Constantinos – 2018-04-15T18:19:21.797