This would lead me to believe that the only appropriate activation functions would either be linear or tanh. However, I see any many RL papers the use of Relu.

Generally you want a linear output, unless you can guarantee scaling total possible reward to within a limited range such as $[-1,1]$ for $\text{tanh}$. Reminder this is not for estimating individual rewards but for total expected reward when following the policy you want to predict (typically the optimal policy eventually, but you will want the function to be able to estimate returns for other policies visited during optimisation).

Check carefully in the papers you mention, whether the activation function is applied in the output. If all rewards are positive, there should be no problem using ReLU for regression, and it may in fact help stabilise the network in one direction if the output is capped at a realistic minimum. However, you should not find in the literature a network with ReLU on output layer that needs to predict a negative return.

If you do want to have both negative and positive outputs, are you limited to just tanh and linear?

There are likely others, but linear will be far the most common.

Is it a better strategy (if possible) to scale rewards up so that they are all in the positive domain (i.e. instead of [-1,0,1], [0, 1, 2]) in order for the model to leverage alternative activation functions?

It may sometimes be worth considering scaling rewards by a factor, or normalising them, to limit gradients, so that learning is stable. This was used in the Atari-games-playing DQN network to help the same algorithm tackle multiple games with different ranges of scoring.

In continuous problems, the absolute value of reward is usually flexible, you are generally interested in getting the best mean reward per time step. So in that case you could scale so that minimum reward is 0, and use ReLU or other range limited transform in output - as above that *might* help with numeric stability.

In episodic problems without a fixed length, you typically don't have a such a free choice, because the agent is encouraged to end the episode quickly when rewards are negative. This is something you might want for instance if the goal is to complete a task as quickly or energy-efficiently as possible. A good example of this is "Mountain Car" - granting only positive rewards in that scenario would be counter-productive, although you might still get acceptable results with positive reward only at the end and discounting.

The general case is that rewards can be arbitrarily scaled and centred for continuous problems without changing the agent's goal meaningfully, but only arbitrarily scaled for episodic problems.

Great answer across the board - thank you @Neil. Regarding the first point, findings in literature, there is the study by openAI on HER: https://arxiv.org/pdf/1707.01495.pdf and Human level control deep learning: https://www.nature.com/articles/nature14236. They use relu in the hidden layers, but not the output. My mistake was the thought that once you have one relu layer, all further output would have to be between 0 - infinity. But you certainly can have negative weights in subsequent layers - so as long as the relu isn't the output layer, you can still have negative output values... thanks!

– ZAR – 2017-12-27T00:16:26.200