How to define the final / terminal state for Q learning?


I'm training an agent using RL and the SARSA function to update a Q function, but I'm confused how you handle the final state. In this case when the game ends and there is no S'.

For example, the agent performed an action based on the state S, and because of that the agent won or lost and there is no S' to transition to. So how do you update the Q function with the very last reward in that scenario because the state hasn't actually changed. In that case S' would equal S even though an action was performed and the agent received a reward (they ultimately won or lost, so quite important update to make!).

Do I add an extra inputs to the State agent_won and game_finished and that's the difference between S and S' for the final Q update?

EDIT: to make clear this is in reference to a multi-agent/player system. So the final action the agent takes could have a cost/reward associated with it, but the subsequent actions other agents then take could further determine a greater gain or loss for this agent and whether it wins or loses. So the final state and chosen action, in effect, could generate different rewards without the agent taking further actions.


Posted 2019-01-31T13:23:24.273

Reputation: 395


See That question is about Sarsa rather than $Q$-learning, but exactly the same concept applies.

– Dennis Soemers – 2019-01-31T13:36:45.427

Perfect, thanks Dennis. I am using SARSA to update a Q-function so my question is identical to that one. Taking what I learned there, I set my final reward (the final gain/loss of the game) as the transition into the final state. I then add another state, set a flag on the state, e.g. agent_state depending on whether the agent won/lost, and set the reward into that state as zero as we expect to receive zero reward in future as it's a terminal state. – BigBadMe – 2019-01-31T13:47:45.883

That's one way to implement it yes, although arguably a bit "hacky". Quite often people implement two cases of the update rule: they explicitly check if the episode has ended (which in many formulations does actually involve reaching a terminal state $S'$ in which there are no more legal actions), and if so run the update rule with only the reward term (no $Q(S', A')$ term). If the episode has not yet ended, they run the normal update rule. Such an implementation may be slightly cleaner / more easily readable / more explicit. – Dennis Soemers – 2019-01-31T13:53:15.907

Ah okay. So, say I know I'm transitioning to the final state, and there will be no S', instead of doing this: target = reward + self.y * np.max(self.model.predict(state_prime)) I do this: target = reward + self.y ? – BigBadMe – 2019-01-31T13:58:17.787

2I would say that there still is in fact an $S'$: that's your terminal state. However, there's no subsequent action $A'$ anymore, so that's why you can't compute $Q(S', A')$ anymore and simply set that term to $0$. Anyway... self.y, is that the discount factor $\gamma$? If so, that shouldn't be added, since that gets multiplied by the "fake" $Q(S', A') = 0$. You'd just want target = reward. – Dennis Soemers – 2019-01-31T14:15:40.917

1You're a star, Dennis. Thanks so much for your help. – BigBadMe – 2019-01-31T14:19:23.487

I've thought about this a bit more, and the one problem here is, taking a specific action into the final state could have a cost (negative reward) attached, e.g. spending money to build a unit. Only once other players have performed their actions will I then ultimately know if there was a reward from performing the final action. 1/2 – BigBadMe – 2019-01-31T14:37:37.120

So it's almost as if two exact same states, taking the same action then ultimately produce different rewards, which was why I was thinking I'd need to add another state once the game has finished with the game result in order to differentiate between them in some way. So performing that last action and the cost/reward, isn't actually the final interaction the agent has, because the agent could then receive a reward based on the other agents taking their own actions. It's a bit of an odd un... 2/2 – BigBadMe – 2019-01-31T14:38:24.147

1My comments were assuming a traditional, single-agent Markov Decision Process. If you have two or more agents, and/or other complicating factors, the implementation may have to change a bit. You can edit such details into your question, and I'll probably be able to have a look at that later today (or maybe someone else already does in the meantime) – Dennis Soemers – 2019-01-31T14:42:21.663

1I realise now I didn't make clear that it was multi-agent, sorry about that. I'll update my question, and if you have any thoughts how I could implement it then I'd certainly welcome the input. – BigBadMe – 2019-01-31T14:59:20.393



The Sarsa update rule looks like:

$$Q(S, A) \gets Q(S, A) + \alpha \left[ R + \gamma Q(S', A') \right].$$

Very similar, the $Q$-learning update rule looks like:

$$Q(S, A) \gets Q(S, A) + \alpha \left[ R + \gamma \max_{A'} Q(S', A') \right].$$

Both of these update rules are formulated for single-agent Markov Decision Processes. Sometimes you can make them work reasonably ok in Multi-Agent settings, but it is crucial to remember that these update rules should still always be implemented "from the perspective" of a single learning agent, who is oblivious to the presence of other agents and pretends them to be a part of the environment.

What this means is that the states $S$ and $S'$ that you provide in update rules really must both be states in which the learning agent is allowed to make the next move (with the exception being that $S'$ is permitted to be a terminal game state.

So, suppose that you have three subsequent states $S_1$, $S_2$, and $S_3$, where the learning agent gets to select actions in states $S_1$ and $S_3$, and the opponent gets to select an action in state $S_2$. In the update rule, you should completely ignore $S_2$. This means that you should take $S = S_1$, and $S' = S_3$.

Following the reasoning I described above literally may indeed lead to a tricky situation with rewards from transitioning into terminal states, since technically every episode there will be only one agent that directly causes the transition into a terminal state. This issue (plus also some of my explanation above being repeated) is discussed in the "How to see terminal reward in self-play reinforcement learning?" question on this site.

Dennis Soemers

Posted 2019-01-31T13:23:24.273

Reputation: 7 644

1Thanks Dennis. I understand what you're saying, but my question is how does the agent receive the final reward, which could be ultimately determined by another player making a mistake with their action? In a literal sense, how do I perform the update in that scenario? I keep a list of all of the experience tuples (S,a,r,S'), and I only update them once the game is finished, so I could conceivably just add the ultimate reward to the existing reward value for the final S a S' transition. Would that work? I must somehow give the final reward to the agent, I'm just not sure how... – BigBadMe – 2019-01-31T20:35:19.667

2@BigBadMe Yes, with the standard Sarsa/$Q$-learning updates you simply give "credit" for the final reward (i.e. the win/loss) to the last transition caused by your learning agent. For these algorithms to be applicable, you have to pretend that there is no other agent, they're just a part of "the environment" and any actions they select are just a part of "the environment's transition dynamics". That may not be ideal, but that's how it works when you try to apply a single-agent algorithm to a multi-agent setting. It may still work out in practice (especially with a proper self-play setup). – Dennis Soemers – 2019-01-31T20:46:32.243

2Perfect, I think I have the pieces put together now. Thanks again for all your input on this, much appreciated. If find this whole field absolutely fascinating, I really enjoy learning about it! – BigBadMe – 2019-01-31T22:14:43.773