Q-learning when minimising a total cost instead of maximising a total reward

1

I have a decision problem where the results are measured as a cost that I want to minimise. It seems like a good fit to Q-learning, but I am not sure how to adjust it to deal with a cost instead of a reward.

Which one is better:

1. Initializing Q-values for all actions with zeros, then getting the agent to learn the actions that maximize the Q-values, and later filter out the actions with minimum Q-values. The Q-values update would then be:
q_dict['state1']['act1'] +=
r + (max([q_dict['state2'][u] for u in q_dict['state2']]))

1. Initializing Q-values with a big number then getting the agent to learn actions that minimize the Q-values, and later filtering out the actions with minimum Q-values. The Q-values update would then be:
q_dict['state1']['act1'] -=
r + (max([q_dict['state2'][u] for u in q_dict['state2']]))


Okay, I want to change reward to cost and learn a policy that minimizes the cost instead of maximize total reward. Edit it for me please, I did not know the notation. – EArwa – 2019-07-31T12:30:02.817

I have edited your post. In case I have misunderstood, please review and make sure it still asks what you want to ask – Neil Slater – 2019-07-31T13:23:11.237

It is okay. That is exactly what I wanted to ask. – EArwa – 2019-07-31T14:00:25.920

1

I have a decision problem where the results are measured as a cost that I want to minimise. It seems like a good fit to Q-learning, but I am not sure how to adjust it to deal with a cost instead of a reward.

The simplest way to do this, without changing anything else about your learning algorithm, is to notice that

$$Reward = -Cost$$

So literally just optimise against the expected return based on sums over your negative costs, using standard Q learning. Everything will work as normal. Your best agent may still end up with a negative expected return (and negative Q values), but maximising this should still restult in an optimal policy.

If you really must use minimising cost as your objective for some reason, then there are a few small changes you need to make for that to work with Q learning.

The definition of your Q function becomes expected discounted sum of future costs:

$$Q(s,a) = \mathbb{E}_{\pi}[\sum_{k=0}^{\infty} \gamma^k C_{t+k+1} | S_t=s, A_t=a]$$

(This literally just replaces $$R_t$$ with $$C_t$$)

Then your best guess at the optimal policy is the one that minimises expected future costs:

$$\pi(s) = \text{argmin}_a Q(s,a)$$

And your Q-learning update rule is also based on assumed optimisation of minimising the next step

$$Q(s, a) \leftarrow Q(s, a) + \alpha(c + \gamma \text{min}_{a'}[Q(s',a')] - Q(s, a))$$

This doesn't match either of your suggestions. If I was to correct your code, then using rewards (with r = -c), it would look like this:

q_dict[state1][act1] += alpha * (r + max(q_dict[state2].values()) - q_dict[state1][act1])


where alpha is the learning rate, and I assume there is no discounting (so it must be an episodic problem, not continuous).

If you wanted to use cost c directly, and find policies that minimise total cost, then it looks like this:

q_dict[state1][act1] += alpha * (c + min(q_dict[state2].values()) - q_dict[state1][act1])


i.e. you substitute c for r and min for max.

You ideas about having different starting values might make some difference to convergence rates. However, this is not related to whether you use a cost or a reward.

I advise against using the cost directly like this. Although it is simple and will work, whenever you read any RL articles you will have to keep adjusting whether to switch max for min. The convention of maximising the sum (or average) reward is much more common in RL tutorials, and whilst you are learning it, you will save yourself just a little bit of effort to follow this convention.

Okay, that you. I have made the change, let me see how the results look like. – EArwa – 2019-08-05T11:17:07.763