Why is Reward Engineering considered "bad practice" in RL?


Reward engineering is an important part of supervised learning:

Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering. — Andrew Ng

However tweaking the reward function by hand is generally frowned upon in RL. I'm not sure I understand why.

One argument is that we generally don't know, a priori, what will be the best solution to an RL problem. So by tweaking the reward function, we may bias the agent towards what we think is the best approach, while it is actually sub-optimal to solve the original problem. This is different from supervised learning, where we know from the start what we want to optimize.

Another argument would be that it's conceptually better to consider the problem as a black box, as the goal is to develop a solution as general as possible. However this argument could also be made for supervised learning!

Am I missing anything?


Posted 2019-03-10T22:55:04.887

Reputation: 215

I also asked the question on /r/reinforcementlearning: https://www.reddit.com/r/reinforcementlearning/comments/azllha/why_is_reward_engineering_taboo_in_rl/

– MasterScrat – 2019-03-12T10:15:57.303



One argument is that we generally don't know, a priori, what will be the best solution to an RL problem. So by tweaking the reward function, we may bias the agent towards what we think is the best approach, while it is actually sub-optimal to solve the original problem.

This is the main issue.

Am I missing anything?

I think you are missing the main point of setting a reward function. It should be a value that is maximised by an agent achieving the goals it has been set. Each time you change the reward function you may be explicitly setting new and different goals.

Changing a reward function should not be compared to feature engineering in supervised learning. Instead a change to the reward function is more similar to changing the objective function (e.g. from cross-entropy to least squares, or perhaps by weighting records differently) or selection metric (e.g. from accuracy to f1 score). Those kinds of changes may be valid, but have different motivations to feature engineering.

Tweaking a reward function is also called reward shaping, and sometimes it can have good results. It does come with the same risks as above, but if done carefully it can improve learning rate. For example if to achieve the main goal A, it is absolutely necessary to achieve B and C as interim steps, then it should be OK to reward achieving B and C - typically the things you have to worry about is whether it is possible for the agent to repeatedly achieve B or C via some loop through states - so you may need to add to the state vector whether it has achieved B or C enough times and only grant reward for first visit to B and C.

Feature engineering does still exist in RL, and is about how you represent state and action spaces. For instance, in a chase scenario where an agent needs to get close to a moving target, you should find that representing the state as polar co-ordinates of the target from the current agent's position is far easier for RL to learn optimal policy than if you represent state as cartesian co-ordinates of agent and target.

Neil Slater

Posted 2019-03-10T22:55:04.887

Reputation: 24 613

1Also, notice that due to the fact that RL (real-world) problems often involve so many implicit assumptions, it is close to impossible to proove that B and C are "absolutely necessary" steps towards getting to A. Practically, it is too complicated to assume that there is no way to A that does not get through B and/or C, that's why it might be just more reasonable to leave the options open. – mapto – 2019-03-11T08:20:05.923

Very informative. I think it would be fruitful to build a connection to this article that seemingly takes the opposite stance and categorizes reward engineering https://ai-alignment.com/the-reward-engineering-problem-30285c779450

– Esmailian – 2019-03-11T17:38:40.477

1@Esmailian: it is interesting, but it is using a different definition of "reward engineering" to OP. The article you link is looking into approaches for autonomous reward function discovery (to which I might add recent work on intrinsic rewards such as curiosity-driven RL), whilst OP appears concerned with human construction of reward functions using domain knowledge. It may be possible to blur the lines between these two things. I'll have a think about a way of adding to the answer . . . – Neil Slater – 2019-03-11T18:34:08.890

1@NeilSlater It is clear now. So OP's intention is "reward shaping", I think it helps if you mention this difference too. – Esmailian – 2019-03-11T18:44:30.183

@Esmailian So you would say injecting domain knowledge (eg cartesian to polar coordinates) is "reward shaping", while automatic reward function augmentation (eg curiosity, maximum entropy) is "reward engineering"? – MasterScrat – 2019-03-12T10:19:43.500

1@MasterScrat: Cartesian to polar coordinates is "feature engineering". Changing your reward function using domain knowledge (my example adding reward for B and C because they are necessary to get to A) is "reward shaping". I don't think "reward engineering" is as well known a term, but Esmailan's link seems to be using it for ideas that manipulate the reward function automatically – Neil Slater – 2019-03-12T10:21:37.373

Ah yes bad example of injecting knowledge. Let's say: I know Mario needs to move right, so I add a bonus to the reward function when the position increases on the x axis. Is that reward shaping? reward engineering? – MasterScrat – 2019-03-12T10:27:54.600

1@MasterScrat: I would call that "reward shaping", especially if your true goal is to get the highest score. I think you will see that term more often in the literature. I would not too hung up on what people call these things though, the field of RL merges similar work from multiple disciplines and there are variations on what things are called. E.g. "return" is essentially the same as "utility", and you will see "utility" used a lot when RL is extension of control theory or game theory – Neil Slater – 2019-03-12T10:31:13.303