What happens to the optimal value function if the reward is multiplied by a constant?

2

What happens to the optimal action-value function, $q_*$ if the reward is multiplied by a constant $c$? Is the optimal action-value function also multiplied by such a constant?

nbro

Posted 2019-09-15T20:58:34.960

Reputation: 19 783

What is reward though? Is it 1 reward at the end or there can be multiple rewards? Does this include negative reward/penalty at each step (if someone chooses to include it)? – DuttaA – 2019-09-16T04:32:28.340

@DuttaA If you multiply all outputs of the reward function by a constant. – nbro – 2019-09-16T11:52:35.530

Answers

2

The Bellman optimality equation is given by

$$q_*(s,a) = \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')}q_*(s',a')) \tag{1}\label{1}.$$

If the reward is multiplied by a constant $c > 0 \in \mathbb{R}$, then the new optimal action-value function is given by $cq_*(s, a)$.

To prove this, we just need to show that equation \ref{1} holds when the reward is $cr$ and the action-value is $c q_*(s, a)$.

\begin{align} c q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma \max_{a'\in\mathcal{A}(s')} c q_*(s',a')) \tag{2}\label{2} \end{align}

Given that $c > 0$, then $\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\max_{a'\in\mathcal{A}(s')}q_*(s',a')$, so $c$ can be taken out of the $\operatorname{max}$ operator. Therefore, the equation \ref{2} becomes

\begin{align} c q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}p(s',r \mid s,a)(c r + \gamma c \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}}c p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ &= c \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \\ q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \max_{a'\in\mathcal{A}(s')} q_*(s',a')) \tag{3}\label{3} \end{align} which is equal to the the Bellman optimality in \ref{1}, which implies that, when the reward is given by $cr$, $c q_*(s,a)$ is the solution to the Bellman optimality equation.

If $c=0$, then \ref{2} becomes $0=0$, which is true. If $c < 0$, then $\max_{a'\in\mathcal{A}(s')} c q_*(s',a') = c\min_{a'\in\mathcal{A}(s')}q_*(s',a')$, so equation \ref{3} becomes

\begin{align} q_*(s,a) &= \sum_{s' \in \mathcal{S}, r \in \mathcal{R}} p(s',r \mid s,a)(r + \gamma \min_{a'\in\mathcal{A}(s')} q_*(s',a')) \end{align}

which is not equal to the Bellman optimality equation in \ref{1}.

nbro

Posted 2019-09-15T20:58:34.960

Reputation: 19 783