## Upper limit to the maximum cumulative reward in a deep reinforcement learning problem

4

Is there an upper limit to the maximum cumulative reward in a deep reinforcement learning problem? For example you want to train an DQN agent in an environment and you want to know what is the highest possible value you can get from the cumulative reward, so you can compare this with your agents performance.

2

In any reinforcement learning problem, not just Deep RL, then there is an upper bound for the cumulative reward, provided that the problem is episodic and not continuing.

If the problem is episodic and the rewards are designed such that the problem has a natural ending, i.e. the episode will end regardless of how well the agent does in the environment, then then you could work it out by calculating the max possible reward in each step of the episode; however this is potentially non-trivial depending on your environment.

For an example in a trivial setting, however, imagine the problem of cartpole -- I could define the MDP to have a reward of +1 for every time step that the agent is able to balance the pole upright, and 0 when the pole falls. If I also defined that the problem terminates after 200 time steps then the upper bound on cumulative rewards for this problem would be 200.

In general, if the problem is continuing then in theory the problem goes on infinitely and so there is no upper bound, as the episode never ends -- this is partly why we use the discount factor, to ensure that $$\sum_{k=0} \gamma^k R_{t+k}$$ converges.

1"In any deep reinforcement learning problem, not just Deep RL", you mean in any reinforcement learning problem without deep right ? Also, just because problem is episodic doesn't guarantee that sum is finite. You could set up your rewards such that agent never wants to end episode even though there is a terminal state which would make environment episodic. You would need to add discount factor to guarantee boundness. – Brale – 2020-07-19T05:44:10.757

2

My answer to:Is there an upper limit to the maximum cumulative reward in a deep reinforcement learning problem?

Yes but depending on the environment, if dealing with the theoretical environment, where there are infinite number of time steps.

Calculating the upper bound

In reinforcement learning (deep RL inclusive), we want to maximize the discounted cumulative reward i.e. Find the upper bound of: $$\sum_{k=0}^\infty \gamma^kR_{t+k+1}, where$$ $$\gamma$$ $$\epsilon$$ $$[0, 1)$$

Before we find the upper bound of the series above, we need to find out if the upper bound exists i.e. whether it converges according to the environment specifications such as the reward function.

I will provide one example environment where the series converges. It is an environment that has simple rules and goes on for infinite time steps. It's reward function definition is as follows:

-> A reward of +2 for every favorable action.

-> A reward of 0 for every unfavorable action.


So, our path through the MDP that gives us the upper bound is where we only get 2's.

Let's say $$\gamma$$ is a constant, example $$\gamma = 0.5$$, note that $$\gamma$$ $$\epsilon$$ $$[0, 1)$$

Now, we have a geometric series which converges:

$$\sum_{k=0}^\infty \gamma^kR_{t+k+1}$$ = $$\sum_{k=1}^\infty (1)(2\gamma^{k-1})$$ = $$\sum_{k=1}^\infty 2\gamma^{k-1}$$ = $$\frac{2}{1 - 0.5}$$ = $$4$$

Thus the upper bound is 4.

For environments that go on for a finite number of time steps the the upper bound does exist but for certain environments, likewise for the infinite time step environments, it may be a bit difficult to calculate but not necessarily impossible, the environments I speak of are ones with complicated reward functions and environments i.e. the environments are stochastic or the reward function's possible values are dependent on the state, they always are but we can loosely say that a reward function is independent of state when all possible reward values for an environment can be given in any state, obviously with regards to the actions taken though.

2

Lets assume $$\sup_{s,a} r(s,a). Then for continuing problems the upper bound can be obtained by \begin{align} \sum_{t=0}^{\infty} \gamma^{t}r(s_t,a_t) &\le \sum_{t=0}^{\infty} \gamma^{t} \sup_{s,a}r(s,a) \nonumber \\ &=\sum_{t=0}^{\infty} \gamma^{t} b = \frac{b}{1-\gamma}. \end{align}

We can use the same bound for episodic tasks with discounted return. For episodic tasks without discounting ($$\gamma=1$$) the above sum goes to infinity. However, if we know the episode length $$T$$, we can use $$Tb$$ as an upper bound.

1The last part, where you say Tb can be used as an upper bound, may not work since e.g. The reward is 2 for every favorable action during the episode and 100 is given for the action that takes your agent to the goal state is an edge case. Technically Tb will be an upper bound but won't be sufficiently accurate for use. I think the solution for continuing problems is good. – rert588 – 2020-07-18T22:02:06.270

1I agree. There might exist tighter upper bounds, but they require further information about the problem. $Tb$ can be used for any problem. – M.S. – 2020-07-18T22:30:56.223