I want to create an AI which can play five-in-a-row/gomoku. As I mentioned in the title, I want to use reinforcement learning for this.
I use policy gradient method, namely REINFORCE, with baseline. For the value and policy function approximation, I use a neural network. It has convolutional and fully connected layers. All of the layers, except for the output, are shared. The policy's output layer has $8 \times 8=64$ (the size of the board) output unit and softmax on them. So it is stochastic. But what if the network produces a very high probability for an invalid move? An invalid move is when the agent wants to check a square which has one "X" or "O" in it. I think it can stuck in that game state.
Could you recommend any solution for this problem?
My guess is to use the actor-critic method. For an invalid move, we should give a negative reward and pass the turn to the opponent.