there seems to be a major difference how the terminal reward is received/handled in self-play RL vs "normal" RL which confuses me.
I implemented TicTacToe the normal way, where a single agent plays against an environment that manages the state and also replies with a new move. In this scenario the agent receives a final reward of +1,0,-1 for a win/draw/loss.
Next I implemented TicTacToe in a self-play mode where two agents perform moves one after the other and the environment only manages the state and gives back the reward. In this scenario an agent can only receive a final reward of +1 or 0 because after his own move he will never be in a terminal state in which he lost (only agent2 could terminate the game in such a way). That means:
- In self-play, episodes end in such a way that only one of the players sees the terminal state and terminal reward.
- Because of point one, an agent can not learn if he made a bad move that enabled his opponent to win the episode. Simply because he does not receive a negative reward.
This seems very weird to me. What am I doing wrong? Or if I'm not wrong, how do I handle this problem ??