Having a reward structure which gives high positive rewards compared to the negative rewards


I am training an RL agent using PPO algorithm for a control problem. The objective of the agent is to maintain temperature in a room. It is an episodic task with episode length of 9 hrs and step size(action being taken) for every 15 mins.During the training of an agent, from a given state, agent takes an action.Then I check the temperature of the room after 15 mins(step size) and if this temperature is within limits, I give the action a very high positive reward, and if the temperature is not in he limits, I give a negative reward.Episode ends after 36 actions(9hrs * 4 actions/hour)(step size is 15 mins)

My formulation of reward structure.

zone_temperature = output[4]  # temperature of the zone 15 mins after the action is taken

thermal_coefficient = -10

if zone_temperature < self.temp_limit_min:
    temp_penalty = self.temp_limit_min - zone_temperature
elif zone_temperature > self.temp_limit_max:
    temp_penalty = zone_temperature - self.temp_limit_max
else :
    temp_penalty = -100

        reward = thermal_coefficient * temp_penalty

value of zone_temperature varies from limits in a range of 0 to 5 deg. So the reward for when the actions are bad(temperature not in limits) varies from 0 to -50 , but when the actions are good(temperature is in limits) reward is +1000 . I had such a formulation so that the agent can easily understand which is a good action and which one is bad. Is my understanding correct and is it recommended to have such a reward structure for my use case ?



Posted 2019-11-27T04:39:14.807

Reputation: 455



Is my understanding correct and is it recommended to have such a reward structure for my use case ?

Your understanding is not correct, and setting extremely high rewards for the goal state in this case can backfire.

Probably the most important way it could backfire in your case, is that your scaling of bad results becomes irrelevant. The difference between 0 and -50 is not significant compared to the +1000 result. In turn that means the agent will not really care by how much it fails when it does, except as a matter of fine tuning once it is already close to an optimal solution.

If the environment is stochastic, then the agent will prioritise a small chance of being at the target temperatures, over a large chance of ending up at an extreme bad temperature.

If you are using a discount factor, $\gamma$, then the agent will prioritise being at the target temperatures immediately, maybe overshooting and ending up with an unwanted temperature within a few timesteps.

Working in your favour, your environment is one where the goal is some managed stability, like the "cartpole" environment, with a negative feedback loop (the correction to the measured quantities is always to force in the opposite direction). Agents for these are often quite robust to changes in hyperparameters, so you may still find your agent learns successfully.

However, I would advise sticking with a simple and relatively small scale for the reward function. Experimenting with it, after you are certain that it expresses your goals for the agent, is unlikely to lead to better solutions. Instead you should focus your efforts on how the agent is performing, and what changes you can make to the learning algorithm.

What I would do (without knowing more about your environment):

  • Reward +1 per time step when temperature is in acceptable range

  • Reward between -0.1 * temperature difference per time step when temperature is outside acceptable range. It doesn't really matter if you measure that in Fahrenheit or Celsius.

  • No discounting (set discount factor $\gamma =1$ if you are using a formula that includes discounting)

The maximum total reward possible is then +36, and you probably don't expect a worse episode than around -100 or so. This will plot neatly on a graph and be easy to interpret (every unit below 36 is roughly equivalent to performance of an agent spending 15 mins per day just outside acceptable temperatures). More importantly, these lower numbers should not cause massive error values whilst the agent is learning, which will help when training a neural network to predict future reward.

As an aside (as you didn't ask), if you are using a value-based method, like DQN, then you will need to include the current timestep (or timesteps remaining) in the state features. That is because the total remaining reward - as represented by action value Q - depends on the remaining time that the agent has to act. It also doesn't matter to the agent what happens after the last time step, so it is OK for it to choose actions just before then that would make the system go outside acceptable temperatures at that point.

Neil Slater

Posted 2019-11-27T04:39:14.807

Reputation: 24 613

I have trained agents with your suggestions. I have trained two agents with discount factor 0.9 in one and discount factor 0 in the other. Strangely, the agent with discount factor 0.9 performed better than the agent with discount factor 0. I am using PPO algorithm for training the agents. – cvg – 2019-11-28T07:32:13.493

@cvg No discounting means discount factor 1, not 0 – Neil Slater – 2019-11-28T08:08:01.300

Discount factor accounts for the importance it gives to the future states, so by no discounting what I understood was we focus only on the current action and its results and not about the future. So, can you help me understand what is the logic behind having a discount factor of 1. Thanks !! – cvg – 2019-11-28T09:33:54.967

@cvg If there is no discounting, then you do not want discounting have an effect. As it is a multiplier, the "no effect" value is 1. If it were an additive parameter, the "no effect" value would be 0. In the literature, if you read "no discounting" it always means use $\gamma = 1$ – Neil Slater – 2019-11-28T13:11:59.043

okay ,thanks for correcting me, but why is advised to have no discounting in this case? – cvg – 2019-11-29T10:09:27.003

@cvg: Discounting can be applied in episodic environments if you care more about immediate rewards at any time step. It is part of the problem description, and you don't need to treat it as a learning hyperparameter (it becomes more like a solution hyperparameter in continuous environments or in long-running episodic problems if you use neural networks). I am assuming you are treating it as a learning hyper-parameter, because you have seen it used in long-running or continuous environments with DQN? Basically you don't need it here, as far as I can tell. – Neil Slater – 2019-11-29T10:28:59.527