I have a decision problem where the results are measured as a cost that I want to minimise. It seems like a good fit to Q-learning, but I am not sure how to adjust it to deal with a cost instead of a reward.

The simplest way to do this, without changing anything else about your learning algorithm, is to notice that

$$Reward = -Cost$$

So literally just optimise against the expected return based on sums over your negative costs, using standard Q learning. Everything will work as normal. Your best agent may still end up with a negative expected return (and negative Q values), but maximising this should still restult in an optimal policy.

If you really *must* use minimising cost as your objective for some reason, then there are a few small changes you need to make for that to work with Q learning.

The definition of your Q function becomes expected discounted sum of future costs:

$$Q(s,a) = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k C_{t+k+1} | S_t=s, A_t=a]$$

(This literally just replaces $R_t$ with $C_t$)

Then your best guess at the optimal policy is the one that minimises expected future costs:

$$\pi(s) = \text{argmin}_a Q(s,a)$$

And your Q-learning update rule is also based on assumed optimisation of minimising the next step

$$Q(s, a) \leftarrow Q(s, a) + \alpha(c + \gamma \text{min}_{a'}[Q(s',a')] - Q(s, a))$$

This doesn't match either of your suggestions. If I was to correct your code, then using rewards (with `r = -c`

), it would look like this:

```
q_dict[state1][act1] += alpha * (r + max(q_dict[state2].values()) - q_dict[state1][act1])
```

where `alpha`

is the learning rate, and I assume there is no discounting (so it must be an episodic problem, not continuous).

If you wanted to use cost `c`

directly, and find policies that minimise total cost, then it looks like this:

```
q_dict[state1][act1] += alpha * (c + min(q_dict[state2].values()) - q_dict[state1][act1])
```

i.e. you substitute `c`

for `r`

and `min`

for `max`

.

You ideas about having different starting values might make some difference to convergence rates. However, this is not related to whether you use a cost or a reward.

I advise against using the cost directly like this. Although it is simple and will work, whenever you read any RL articles you will have to keep adjusting whether to switch `max`

for `min`

. The convention of maximising the sum (or average) reward is much more common in RL tutorials, and whilst you are learning it, you will save yourself just a little bit of effort to follow this convention.

Okay, I want to change reward to cost and learn a policy that minimizes the cost instead of maximize total reward. Edit it for me please, I did not know the notation. – EArwa – 2019-07-31T12:30:02.817

I have edited your post. In case I have misunderstood, please review and make sure it still asks what you want to ask – Neil Slater – 2019-07-31T13:23:11.237

It is okay. That is exactly what I wanted to ask. – EArwa – 2019-07-31T14:00:25.920