How to deal with invalid output in a policy network?


I am interested in creating a neural network-based engine for chess. It uses a $8 \times 8 \times 73$ output space for each possible move as proposed in the Alpha Zero paper: Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm.

However, when running the network, the first selected move is invalid. How should we deal with this? Basically, I see two options.

  1. Pick the next highest outputted move, until it is a valid move. In this case, the network might automatically over time not put illegal moves on top.
  2. Process the game as a loss for the player who picked the illegal move. This might have the disadvantage that the network might be 'stuck' on only a few legal moves.

What is the preferred solution to this particular problem?


Posted 2019-06-10T12:19:14.283

Reputation: 11

Question was closed 2019-07-11T22:15:53.363



You should have a method to generate a possible moves output based on the board state. Use this as a mask before normalization in the policy head.


Posted 2019-06-10T12:19:14.283

Reputation: 1 845