Deciding on a reward per each action in a given state (Q-learning)



I looked for existing posts on Stack Exchange, which kind of answer the questions about the reward system and reward function, but not specifically what I want to ask here, which is how do you actually decide what reward value to give for each action in a given state for an environment? Is this purely experimental and down to the programmer of the environment. So it's a heuristic approach of simply trying different reward values and see how the learning process shapes up? Of course I understand that the reward values have to make sense, and not just put completely random values, i.e. if the agent makes mistakes then deduct points ...etc. So am I right in saying it's just about trying different reward values for actions encoded in the environment and see how it affects the learning?


Posted 2019-05-12T00:15:10.860

Reputation: 121

Please review the answer here which might give you some idea as to how you can define the reward function. Also, as an example, you can check here for a deterministic reward function example for a tic-tac-toe (numerical) game.

– Anugraha Sinha – 2019-05-14T07:52:11.390



In Reinforcement Learning (RL), a reward function is part of the problem definition and should:

  • Be based primarily on the goals of the agent.

  • Take into account any combination of starting state $s$, action taken $a$, resulting state $s'$ and/or a random amount (a constant amount is just a random amount with a fixed value having probability 1). You should not use other data than those four things, but you also do not have to use any of them. This is important, as using any other data stops your environment from being a Markov Decision Process (MDP).

  • Given the first point, as direct and simple as possible. In many situations this is all you need. A reward of +1 for winning a game, 0 for a draw and -1 for losing is enough to fully define the goals of most 2-player games.

  • In general, have positive rewards for things you want the agent to achieve or repeat, and negative rewards for things you want the agent to avoid or minimise doing. It is common for instance to have a fixed reward of -1 per time step when the objective is to complete a task as fast as possible.

  • In general, reward 0 for anything which is not directly related to the goals. This allows the agent to learn for itself whether a trajectory that uses particular states/actions or time resources is worthwhile or not.

  • Be scaled for convenience. Scaling all rewards by a common factor does not matter at a theoretical level, as the exact same behaviour will be optimal. In practice you want the sums of reward to be easy to assess by yourself as you analyse results of learning, and you also want whatever technical solution you implement to be able to cope with the range of values. Simple numbers such as +1/-1 achieve that for basic rewards.

Ideally, you should avoid using heuristic functions that reward an agent for interim goals or results, as that inserts your opinion about how the problem should be solved into the system, and may not in fact be optimal given the goals. In fact you can view the purpose of value-based RL is learning a good heuristic function (the value function) from the more sparse reward function. If you already had a good heuristic function then you may not need RL at all.

You may need to compare very different parts of the outcome in a single reward function. This can be hard to balance correctly, as the reward function is a single scalar value and you have to define what it means to balance between different objectives within a single scenario. If you do have very different metrics that you want to combine then you need to think harder about what that means:

  • Where possible, flatten the reward signal into the same units and base your goals around them. For instance in business and production processes if may be possible to use currency as the units of reward and convert things such as energy used, transport distance etc into that currency.

  • For highly negative/unwanted outcomes, instead of assigning a negative reward, consider whether a constraint on the environment is more appropriate.

You may have opinions about valid solutions to the environment that you want the agent to use. In which case you can extend or modify the system of rewards to reflect that - e.g. provide a reward for achieving some interim sub-goal, even if it is not directly a result that you care about. This is called reward shaping, and can help in practical ways in difficult problems, but you have to take extra care not to break things.

There are also more sophisticated approaches that use multiple value schemes or no externally applied ones, such as hierarchical reinforcement learning or intrinsic rewards. These may be necessary to address more complex "real life" environments, but are still subject of active research. So bear in mind that all the above advice describes the current mainstream of RL, and there are more options the deeper you research the topic.

Is this purely experimental and down to the programmer of the environment. So it's a heuristic approach of simply trying different reward values and see how the learning process shapes up?

Generally no. You should base the reward function on analysis of the problem and your learning goals. And this should be done at the start, before experimenting with hyper parameters which define the learning process.

If you are trying different values, especially different relative values between different aspects of a problem, then you may be changing what it means for the agent to behave optimally. That might be what you want to do, because you are looking at how you want to frame the original problem to achieve a specific behaviour.

However, outside of inverse reinforcement learning, it is more usual to want an optimal solution to a well-defined problem, as opposed to a solution that matches some other observation that you are willing to change the problem definition to meet.

So am I right in saying it's just about trying different reward values for actions encoded in the environment and see how it affects the learning?

This is usually not the case.

Instead, think about how you want to define the goals of the agent. Write reward functions that encapsulate those goals. Then focus on changes to the agent that allow it to better learn how to achieve those goals.

Now, you can do it the way round, as you suggest. But what you are doing in that case is changing the problem definition, and seeing how well a certain kind of agent can cope with solving each kind of problem.

Neil Slater

Posted 2019-05-12T00:15:10.860

Reputation: 14 632

many thanks for the v detailed and comprehensive explanation. As with the other post/question that you replied to, apologies for my v late reply. This is due to me realising that I had gaps in my knowledge on RL, upon reviewing your answer. As such I went away to read more on the subject before coming back to review your answer again and make sure I understood each point. Most of it made sense. There are some explanations which I felt are a bit abstract, and maybe an example would've sufficed, for example the point regarding "hierarchical reinforcement learning or intrinsic rewards" ... – Hazzaldo – 2019-06-04T11:46:54.343

... but in general great detailed explanation. By the way do you know of any good resources to learn more about this subject: i.e. a guide on defining a reward system in a RL problem. Something that I can always refer to whenever coming across a new RL problem and utilise as a guide to follow a good process for analysing the problem, defining the agent's goals and reward system? Many thanks again for your help. Much appreciated. – Hazzaldo – 2019-06-04T11:52:05.720

@Hazzaldo: Thanks for the feedback. I didn't want to extend details on hierarchical reinforcement learning, other than giving you the name. It's an advanced topic. The reason for including it is that the advice in this answer is not absolute, just a start. Unfortunately I don't know of any resources for analysing and deciding on reward schemes - it's as much art as science, as it relies on at least some domain knowledge from your problem. – Neil Slater – 2019-06-04T12:25:20.697

I understand. Thanks again Neil. – Hazzaldo – 2019-06-04T14:16:35.777


Yes, you are exactly right. It is basically an arbitrary choice, although you should consider the reasonable numerical ranges of your activation functions if you decide to go beyond the values +/- 1. You can also have a think about whether you want to add a small reward for the agent reaching states that are near the goal, if you have an environment where such states are discernable.

If you want to have a more machine learning approach to reward values, consider using an Actor-Critic arrangement, in which a second network learns reward values for non-goal states (by observing the results of agent exploration), although you still need to determine end state values according to your hand crafted heuristic.


Posted 2019-05-12T00:15:10.860

Reputation: 605

Many thanks for the answer @DrMcCleod. Definitely gives me something to go on and learn more about the topic using your answer. My apologies for the v late reply. – Hazzaldo – 2019-06-04T11:54:29.487

@Hazzaldo You are most welcome. – DrMcCleod – 2019-06-04T18:11:28.640