My answer to:`Is there an upper limit to the maximum cumulative reward in a deep reinforcement learning problem?`

Yes but depending on the environment, if dealing with the theoretical environment, where there are infinite number of time steps.

**Calculating the upper bound**

In reinforcement learning (deep RL inclusive), we want to maximize the discounted cumulative reward i.e. Find the upper bound of: $\sum_{k=0}^\infty \gamma^kR_{t+k+1}, where$ $\gamma$ $\epsilon$ $[0, 1)$

Before we find the upper bound of the series above, we need to find out if the upper bound exists i.e. whether it converges according to the environment specifications such as the reward function.

I will provide one example environment where the series converges. It is an environment that has simple rules and goes on for infinite time steps. It's reward function definition is as follows:

```
-> A reward of +2 for every favorable action.
-> A reward of 0 for every unfavorable action.
```

So, our path through the MDP that gives us the upper bound is where we only get 2's.

Let's say $\gamma$ is a constant, example $\gamma = 0.5$, note that $\gamma$ $\epsilon$ $[0, 1)$

Now, we have a geometric series which converges:

$\sum_{k=0}^\infty \gamma^kR_{t+k+1}$ = $\sum_{k=1}^\infty (1)(2\gamma^{k-1})$ = $\sum_{k=1}^\infty 2\gamma^{k-1}$ = $\frac{2}{1 - 0.5}$ = $4$

**Thus the upper bound is 4.**

*For environments that go on for a finite number of time steps the the upper bound does exist but for certain environments, likewise for the infinite time step environments, it may be a bit difficult to calculate but not necessarily impossible, the environments I speak of are ones with complicated reward functions and environments i.e. the environments are stochastic or the reward function's possible values are dependent on the state, they always are but we can loosely say that a reward function is independent of state when all possible reward values for an environment can be given in any state, obviously with regards to the actions taken though.*

1"In any deep reinforcement learning problem, not just Deep RL", you mean in any reinforcement learning problem without deep right ? Also, just because problem is episodic doesn't guarantee that sum is finite. You could set up your rewards such that agent never wants to end episode even though there is a terminal state which would make environment episodic. You would need to add discount factor to guarantee boundness. – Brale – 2020-07-19T05:44:10.757