First, some preliminary questions: in this case, what is the optimal policy?

It is the policy that maximises *return* from any given time step $G_t$. You need to be careful with your definition of *return* with continuing environments. The simple expected sum of future rewards is likely to be positive or negative infinity.

There are three basic approaches:

Set an arbitrary finite horizon $h$, so $G_t = \sum_{k=1}^{h} R_{t+k}$

Use discounting, with discount factor $\gamma < 1$, so $G_t = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}$

Use average reward per time step, $\bar{R} = \lim\limits_{h \to \infty}\frac{1}{h} \sum_{k=1}^{h} R_{t+k}$, which leads to thinking in terms of *differential return* $G_t = \sum_{k=1}^{\infty} (R_{t+k} - \bar{R})$

With large enough horizon (so that state is ergodic) or large enough $\gamma$ (close to $1$), these approaches are similar and should result in approximately the same policy. The difference is in how you construct an agent in detail to solve the problem of maximising the value.

Given the infinite horizon there is no terminal state but only an objective to maximise the rewards, so I can't run more than one episode, is it correct?

The term *episode* becomes meaningless. It may be simpler to think of this differently though - you are trying to solve a non-episodic problem here, in that there is no natural separation of the process into separate meaningful episodes. No physical process is actually infinite, that's just a theoretical nicety.

In practice, if you can run your environment in simulation, or multiple versions of it for training purposes, then you do start and stop pseudo-episodes. You don't treat them as episodes mathematically - i.e. there is no terminal state, and you can never obtain a simple episodic return value. However, you can decide to stop an environment and start a new one from a different state.

Even if there is only one real environment, you can sample sections of it for training, or attempt to use different agents over time, each of which is necessarily finite in nature.

Consequently, what is the difference between on-policy and off-policy learning given this framework?

The notions of on-policy and off-policy are entirely separate from episodic vs continuing environments.

On-policy agents use a single policy both to select actions (the behaviour policy), and as the learning target. When the learnt policy is updated with new information that immediately affects the behaviour of the agent.

Off-policy agents use two or more policies. That is one or more *behaviour* policies that select actions (the behaviour policy), and a target policy that is learned, typically the best guess at the optimal policy given data so far.

These things do not change between episodic and continuing tasks, and many algorithms remain identical when solving episodic vs continuing problems. For example, DQN requires no special changes to support continuing tasks, you can just set a high enough discount factor and use it as-is.

You cannot wait until the end of an episode, so certain update methods won't work. However, value bootstrapping used in temporal difference (TD) learning still works.

In some cases, you will want to address the differences in the definition of return. Using an average reward setting typically means looking at a differential return for calculating TD targets for example.

Although all your questions are related, please, next time, ask one question per post! You're asking at least 3 distinct questions here. – nbro – 2020-05-19T19:21:57.787