How to see terminal reward in self-play reinforcement learning?


there seems to be a major difference how the terminal reward is received/handled in self-play RL vs "normal" RL which confuses me.

I implemented TicTacToe the normal way, where a single agent plays against an environment that manages the state and also replies with a new move. In this scenario the agent receives a final reward of +1,0,-1 for a win/draw/loss.

Next I implemented TicTacToe in a self-play mode where two agents perform moves one after the other and the environment only manages the state and gives back the reward. In this scenario an agent can only receive a final reward of +1 or 0 because after his own move he will never be in a terminal state in which he lost (only agent2 could terminate the game in such a way). That means:

  1. In self-play, episodes end in such a way that only one of the players sees the terminal state and terminal reward.
  2. Because of point one, an agent can not learn if he made a bad move that enabled his opponent to win the episode. Simply because he does not receive a negative reward.

This seems very weird to me. What am I doing wrong? Or if I'm not wrong, how do I handle this problem ??



Posted 2018-05-30T10:50:16.373

Reputation: 51

This question should be migrated to:

– JahKnows – 2018-06-01T08:46:24.240



When one agent makes a move, that move should be perceived as part of the "state transition" executed "by the environment" from the perspective of the other agent.

For example, suppose that, as a "neutral third party" we view the game as follows, as a sequence of states, actions and a terminal reward. I use A to denote actions selected by the first player, and B to denote actions selected by the second player:

S1 -> A1 -> S2 -> B1 -> S3 -> A2 -> S4 -> B2 -> S5 -> A3 -> Terminal Reward

Then, the first player should only get the following observations:

S1 -> A1 -> S3 -> A2 -> S5 -> A3 -> Terminal Reward

note how states S2 and S4 are skipped entirely, they are not really states from the perspective of the first player, they're just halfway through the transition caused by the first player's action and are not interesting for the first player.

Similarly, the second player should only get the following observations:

S2 -> B1 -> S4 -> B2 -> Terminal Reward

Dennis Soemers

Posted 2018-05-30T10:50:16.373

Reputation: 7 644

That makes sense. Nice explanation ! – Axel – 2018-05-31T05:28:57.180


If you are running self-play in a two player zero sum game, then you can do the following:

  • Arbitrarily decide the reward scheme for winning, drawing, losing is +1, 0, -1 for Player A.

  • Have Player A's goal to maximise reward, and Player B's goal to minimise reward.

This means you can combine both players' view of the values of positions and plays into a single metric, which can be learned and/or searched depending on your algorithm. When searching, you can use MCTS and/or minimax algorithms. When using Q-learning, the only tweak to apply is instead of picking the maximising action, player B will want to pick the minimising action (so will use the min and argmin functions where player A would use max and argmax) - remember when calculating TD error that you are evaluating a position for one player, but will be using reward + max/min of next player's move.

Neil Slater

Posted 2018-05-30T10:50:16.373

Reputation: 14 632