Effect of the order of the reward function

3

1

I have implemented a simple Q-learning algorithm to minimize a cost function by setting the reward to the inverse of the cost of the action taken by the agent. The algorithm converges nicely but there is some difference I get in the global cost convergence for difference orders of the reward function. If I use reward as:

reward = 1/(cost+1)**2

the algorithm converges to a better (lower global cost, the objective of the process) than when I use the reward as:

reward = 1/(cost+1)

What could be the explanation to this difference? Is it the issue of optimism in the face of uncertainty?

3

Reinforcement learning (RL) control maximises the expected sum of rewards. If you change the reward metric, it will change what counts as optimal. Your reward functions are not the same, so will in some cases change the priority of solutions.

As a simple example, consider a choice between trajectories with costs A(0,4,4,4) and B(1,1,1,1). In the original cost formula B is clearly better, with 4 total cost compared with A's cost of 12 - A just has one low cost at the beginning, which I put in deliberately as it exposes the problem with your conversion.

reward = 1/(cost+1)**2.
A: 1.0 + 0.04 + 0.04 + 0.04 = 1.12
B: 0.25 + 0.25 + 0.25 + 0.25 = 1.0

reward = 1/(cost+1).
A: 1.0 + 0.2 + 0.2 + 0.2 = 1.6
B: 0.5 + 0.5 + 0.5 + 0.5 = 2.0


So with this example (numbers carefully chosen), maximising the total reward favours A for sum of inverse squares but B for sum of inverses, whilst B should be the clear preference for minimising sum of costs. It is possible to find examples for both of your formulae where the best sum of rewards does not give you the lowest cost.

In your case, if you truly want to minimise total cost, then your conversion to rewards should be:

reward = -cost


Anything else is technically changing the nature of the problem, and will result in different solutions that may not be optimal with respect to your initial goal.

Thank you for that illustration. The reward function you have proposed works fairly well with a neural actor-critic, but when am using a Q-table, it returns a poorer convergence (with higher global cost) than my two functions. What do you think is the matter? I initialized Q-values with zeros. – EArwa – 2020-05-15T10:14:55.917

@EArwa: Sorry I don't know. I suggest you ask a new question about that if you are stuck. – Neil Slater – 2020-05-15T11:24:09.883