Reinforcement learning (RL) control maximises the expected sum of rewards. If you change the reward metric, it will change what counts as optimal. Your reward functions are not the same, so will in some cases change the priority of solutions.

As a simple example, consider a choice between trajectories with costs A(0,4,4,4) and B(1,1,1,1). In the original cost formula B is clearly better, with 4 total cost compared with A's cost of 12 - A just has one low cost at the beginning, which I put in deliberately as it exposes the problem with your conversion.

In your two reward formulae:

```
reward = 1/(cost+1)**2.
A: 1.0 + 0.04 + 0.04 + 0.04 = 1.12
B: 0.25 + 0.25 + 0.25 + 0.25 = 1.0
reward = 1/(cost+1).
A: 1.0 + 0.2 + 0.2 + 0.2 = 1.6
B: 0.5 + 0.5 + 0.5 + 0.5 = 2.0
```

So with this example (numbers carefully chosen), maximising the total reward favours A for sum of inverse squares but B for sum of inverses, whilst B should be the clear preference for minimising sum of costs. It is possible to find examples for both of your formulae where the best sum of rewards does not give you the lowest cost.

In your case, if you truly want to minimise total cost, then your conversion to rewards should be:

```
reward = -cost
```

Anything else is technically changing the nature of the problem, and will result in different solutions that may not be optimal with respect to your initial goal.

Thank you for that illustration. The reward function you have proposed works fairly well with a neural actor-critic, but when am using a Q-table, it returns a poorer convergence (with higher global cost) than my two functions. What do you think is the matter? I initialized Q-values with zeros. – EArwa – 2020-05-15T10:14:55.917

@EArwa: Sorry I don't know. I suggest you ask a new question about that if you are stuck. – Neil Slater – 2020-05-15T11:24:09.883