The motivation for adding the discount factor $\gamma$ is generally, at least initially, based simply in "theoretical convenience". Ideally, we'd like to define the "objective" of an RL agent as maximizing the sum of all the rewards it gathers; its *return*, defined as:

$$\sum_{t = 0}^{\infty} R_t,$$

where $R_t$ denotes the immediate reward at time $t$. As you also already noted in your question, this is inconvenient from a theoretical point of view, because we can have many different such sums that all end up being equal to $\infty$, and then the objective of "maximizing" that quantity becomes quite meaningless. So, by far the most common solution is to introduce a discount factor $0 \leq \gamma < 1$, and formulate our objective as maximizing the *discounted return*:

$$\sum_{t = 0}^{\infty} \gamma^t R_t.$$

Now we have an objective that will never be equal to $\infty$, so maximizing that objective always has a well-defined meaning.

As far as I am aware, the motivation described above is the **only motivation for a discount factor being strictly necessary / needed**. This is **not related to the problem being stochastic or deterministic**.

If we have a stochastic environment, which is guaranteed to have a finite duration of at most $T$, we can define our objective as maximizing the following quantity:

$$\sum_0^T R_t,$$

where $R_t$ is a random variable drawn from some distribution. **Even in the case of stochastic environments, this is well-defined, we do not strictly need a discount factor**.

Above, I addressed the question of whether or not a discount factor is **necessary**. This does not tell the full story though. Even in cases where a discount factor is not strictly **necessary**, it still might be **useful**.

Intuitively, discount factors $\gamma < 1$ tell us that **rewards that are nearby in a temporal sense** (reachable in a low number of time steps) **are more important than rewards that are far away**. In problems with a finite time horizon $T$, this is probably not true, **but it can still be a useful heuristic / rule of thumb**.

Such a rule of thumb is particularly useful in stochastic environments, because **stochasticity can introduce greater variance / uncertainty over long amounts of time than over short amounts of time**. So, even if in an ideal world we'd prefer to maximize our expected sum of undiscounted rewards, **it is often easier to learn how to effectively maximize a discounted sum**; we'll learn behaviour that mitigates uncertainty caused by stochasticity because it prioritizes short-term rewards over long-term rewards.

This rule of thumb especially makes a lot of sense in stochastic environments, but I don't agree with the implication in that book that it would be restricted to stochastic environments. A discount factor $\gamma < 1$ has also often been found to be beneficial for learning performance in deterministic environments, **even if afterwards we evaluate an algorithm's performance according to the undiscounted returns**, likely because it leads to a "simpler" learning problem. In a deterministic environment there may not be any uncertainty / variance that grows over time due to the environment itself, but **during a training process there is still uncertainy / variance in our agent's behaviour** which grows over time. For example, it will often be selecting suboptimal actions for the sake of exploration.

Quite elucidating. So glad to seem the math formatting getting immediate use. Possibly dumb question, but can I ask why t is superscripted with the gamma? – DukeZhou – 2018-08-15T20:23:20.450

2@DukeZhou It's $\gamma$ raised to the power $t$ (time). Suppose, for example, that $\gamma = 0.9$. Then our first reward ($R_0$) will be multiplied by $0.9^0 = 1$ (fully valued). The second reward ($R_1$) is multiplied by $0.9^1 = 0.9$ (only "90% important"). The third reward is multiplied by $0.9^2 = 0.81$ (only "81% important"), etc. Such a sum can be proven to never reach $\infty$ (assuming that none of the individual rewards $R_t$ are equal to $\infty$) – Dennis Soemers – 2018-08-16T08:11:39.813