## What is the optimal value of a Markov Decision process with Single actions at each state?

3

I am trying to solve some questions about a MRP (i.e. a Markov Decision process with only one possible action at each state). The setup is as follows:

• There are two states ($a$ and $b$) stepping to $a$ is terminal.

• All rewards are zero, discount for stepping from $b \to b$ is $1$, and discount for all other steps is zero

• The possible scenarios are: $a\to b$ (probability $1$), $b\to a$ (probability $p$) and $b\to b$ (probability $1-p$).

The first question I have is whether or not its true that the optimal values for each state here are zero? If not how did you derive this?

Second question I have is if we want a parameter $\lambda$ and have a feature $\phi$ such that

$\lambda \times \phi(a)$ and $\lambda \times \phi(b)$

approximate the optimal values at the states a and b and we attempt to approximate such a $\lambda$ by means of TD(0) starting with $\lambda_0 = 1$ how can I find the expected value of $\lambda$ after one episode of updating in terms of $p$? ($E[\lambda_T]$ where $T$ is a random number representing the duration of the episode)

By an episode I mean if we go back from state b to a the episode is over.

1I am also not sure about your definition of "optimal" in "optimal values". There is nothing to optimise here. You can calculate expected discounted future reward. You cannot optimise it. – Neil Slater – 2018-04-14T08:09:03.687

1Looking at the problem definition, and imagining it to be part of some course on RL, I would guess that the setup is undiscounted (i.e. a discount factor $\gamma$ of 1 which can be ignored) - although perhaps you want to work with immediate reward only (in which case $\gamma = 0$), and that where you have used "discount" you meant "reward"? Otherwise the problem makes little sense to me. In case I am wrong could you give a link to definitions of MRP, reward and discount that you are using? – Neil Slater – 2018-04-14T08:13:39.733

I am not allowed to edit but I meant to say in the above that $\lambda_0 = 1$ – BlagBlug1987 – 2018-04-13T21:54:12.177

The optimal value function should be the maximum Value with respect to possible actions at the current state. As Neil states your value will be the expected return which in all cases is zero as your reward is zero. Also you are mentioning only one action per state but i see two while in b: b-->b, b-->a. It seems to me also kind of weird to define different discounts for each transition. By the way, whilst b-->a terminates the episode, so you do not care about discount, the fact that for b-->b you assign discount one you will end up with a sum ore rewards that never converges. – Constantinos – 2018-04-15T18:19:21.797

1

Hi guys thanks for your comments and sorry for the slow reply. I cannot reply to them directly because I am not allowed.

@Constantinos, to your point regarding b: b-->b, b-->a being different actions. I think you misinterpreted what I am asking. There is only one action (what you call it is arbitrary) but two possible outcomes.

@Neil Slater, Yes I think the question is badly written. An MRP is defined as an MDP with only one possible action in each state (or no action dependant on how you look at it). Reward is a real value that is the output of a function that maps a pair of states (so say (a,b) or (b,b) etc ) to a real value. Discount is defined similarly.

Really the crux of the question is: find a linear function approximation parameter $\lambda$ such that $\lambda* \phi(s) \approx v(s)$ where $v(s)$ is the value associated with a certain state $s$. I believe that such values are defined as $v(s) = \sum_{s'}P(s,s')(R(s,s') + \gamma v(s'))$ where $R(.,.)$ is a reward transition function and $P(.,.)$ is a function mapping state pairs to probabilities.

I think the main reason I am lost with this is because the question itself is pretty ambiguous so I am trying to fill in the gaps heuristically.

Thanks Guys