26

11

I want to create an AI which can play five-in-a-row/gomoku. As I mentioned in the title, I want to use reinforcement learning for this.

I use *policy gradient* method, namely REINFORCE, with baseline. For the value and policy function approximation, I use a *neural network*. It has convolutional and fully connected layers. All of the layers, except for the output, are shared. The policy's output layer has $8 \times 8=64$ (the size of the board) output unit and *softmax* on them. So it is stochastic. But what if the network produces a very high probability for an invalid move? An invalid move is when the agent wants to check a square which has one "X" or "O" in it. I think it can stuck in that game state.

Could you recommend any solution for this problem?

My guess is to use the *actor-critic* method. For an invalid move, we should give a negative reward and pass the turn to the opponent.

Thank you. probably I wasn't clear but I chose the move randomly by the outputted probabilites. I will try your advice to set the probability of the illegal moves to zero and see what will hapen. Have a nice day. – Molnár István – 2017-03-14T17:03:11.347

Do you think we need to take the invalid move with negative rewards for sample to train? – shtse8 – 2020-08-16T19:04:51.383

Maybe if there is consistently a lot of probability mass on illegal moves. Otherwise probably not. – BlindKungFuMaster – 2020-08-18T08:12:53.837